Using Program and User Information to Improve File Prediction Performance

Correct prediction of ﬁle accesses can improve system performance by mitigating the relative speed difference be-tween CPU and disks. This paper discusses Program-based Last Successor (PLS) and presents Program-and User-based Last Successor (PULS), ﬁle prediction algorithms that utilize information about the program and user that access the ﬁles. Our simulation results show that PLS makes 21% fewer incorrect predictions and PULS makes 24% fewer incorrect predictions than last-successor with roughly the same number of correct predictions that last-successor makes. The cache space wasted on incorrect predictions can be reduced accordingly. We also show that a cache using the Least Recently Used (LRU) caching algo-rithm can perform better when the PULS is applied. In some cases, a cache using LRU and either PLS or PULS performs better than a cache up to 40 times larger using LRU alone.


Introduction
As disks operate significantly slower than CPUs, prefetching files to cache memory before they are used remains a promising way to mitigate the problem of speed difference between them.Probability and history of file access have been widely used to perform file prediction [10,11,13,4,5,12,16], as have hints or help from programs and compilers [17,3].
While correct file prediction is useful, incorrect prediction is to a certain degree unavoidable.Incorrect prediction not only wastes cache space and disk bandwidth, it also prolongs the time required to bring needed data into the cache if a cache miss occurs while the incorrectly predicted data is being transferred from the disk.Consequently incorrect predictions can lower the overall performance of the system regardless of the accuracy of correct prediction.Files are accessed by programs, and programs are executed on behalf of users.This suggests that consecutive accesses of different files should not occur for no reason.We contend the reason is that programs access more or less the same set of files in roughly the same order every time they execute, especially when they are executed by the same user.Therefore consecutive accesses of different files can be more accurately predicted given knowledge about which programs and users are accessing them.We have developed two file prediction algorithms that utilize this information to generate more accurate file predictions; Program-based Last Successor (PLS) [20] uses program information while Program-and User-based Last Successor (PULS) uses both program and user information.Our results demonstrate that both PLS and PULS generate more accurate file predictions than the other file prediction algorithms examined, with PULS performing somewhat better.In particular, compared with LS, they reduce the number of incorrect file predictions while maintaining roughly the same number of correct predictions to provide better overall file prediction and therefore better overall system performance.
We compare PLS and PULS with Last-Successor (LS) and Finite Multi-Order Context (FMOC) [10].LS can correctly predict the next file to be accessed close to 80% of the time in our experiments.FMOC outperformed LS in a onemonth trace in Kroeger's study [10] but it performs slightly worse than LS in our simulations.Our experiments demonstrate that with traces covering as long as 13 months PLS makes about 21% fewer incorrect predictions and PULS makes about 24% fewer incorrect predictions than LS, giving PULS the highest predictive accuracy among all four models in our comparison.We then use a synthesized trace to demonstrate that PULS outperforms LS in a greater scale when there are more users in the system.
We also examine the cache hit ratios of Least Recently Used (LRU) with no file prediction, and LRU with PULS.We observe that PULS always increases the cache hit ratio and in the best case, LRU and PULS together have a bet-ter cache hit ratio than a cache 40 times larger using LRU alone.

Related Work
Most probability-based predicting algorithms use the history of systemwide file access, which does not consider and take advantage of corresponding program or user information like PULS does.
Griffioen and Appleton use probability graphs to predict future file accesses [5].The graph tracks file accesses observed within a certain window after the current access.For each file access, the probability of its different followers observed within the window is used to make prefetching decisions.Their simulations show that different combinations of window and threshold values could largely affect the performance.
Kroeger and Long predict next file based on probability of files in contexts of FMOC [10].Their research also adopts the idea of data compression like Vitter et al. [19], but they apply it to predicting the next file instead of the next page.
Lei and Duchamp use pattern trees to record past execution activities of each program [13].They maintain different pattern trees for each different accessing pattern observed.A program could require multiple pattern trees to store similar patterns of file accesses in its previous execution.This imposes keeping duplicated information on the system.Pattern trees of a running program are compared with the current accessing pattern.If a match found, files in that pattern tree are prefetched to memory.One of the main differences between their algorithm and PULS is that PULS makes the predicting decision for each individual file, so it can adapts to different patterns of file access more rapidly.
Vitter, Curewitz, and Krishnan adopt the technique of data compression to predict next required page [4,19].Their observation is that data compressors assign a smaller code to the next character with a higher predicted probability.Consequently a good data compressing algorithm should also be good at predicting the next page more accurately.
Patterson et al. develop TIP to do prediction using hints provided from modified compilers [17].Accordingly, resources can be managed and allocated more efficiently.Extra coding in programs and language dependence are disadvantages of this type of approach.In the case of no access to source codes there is no way to generate hints.Hints generated statically by compilers sometimes may not be very useful if file accesses cannot be decided until runtime.
Chang and Gibson design a tool which can transform UNIX application binaries to perform speculative execution and issues hints [3].Their algorithm can eliminate the issue of language independence, but it can only be applied to single-thread applications.
Mowry et al. use modified compiler to provide future access patterns for out-of-core applications [14].Kotz and Ellis define representative parallel file access patterns in parallel disk systems [9].Cao et al. define four properties that optimal predicting and caching model should satisfy [2].Palmer and Zdonik use unit pattern to prefetch data in database applications [16].Kimbrel et al. examine four related algorithms to find out when a prefetching algorithm should act aggressively or conservatively [7].
Prefetching data between different levels of cache, such as moving data from the off-chip cache to the on-chip cache before the processor needs it, can also reduce the latency of memory operations [6].
Probability-based predicting algorithms, in general, respond to changes of reference pattern more dynamically than those relying on help from compilers and applications.However over a longer period of time, accumulated probability may not closely reflect the latest accessing pattern and even may mislead predicting algorithms sometimes.

Predicting Models Compared
We will briefly discuss different predicting models compared in our study.We start with LS, which is a common benchmark to compare predicting models.We then explain FMOC, which outperformed LS in previous study, followed by the discussion of PLS and PULS.

Last Successor (LS)
LS is a common benchmark used in comparing different schemes of file prediction.For each file accessed, LS predicts the most recent successor following the last access of the current file as the next successor.Metadata kept in this model is simple.Only one last-successor is needed for each file.However, scheduling the same set of programs in different orders may generate totally different file access patterns.This means the performance of LS could vary dramatically even when the same set of programs executed repeatedly.

Finite Multi-Order Context (FMOC)
FMOC predicts the next file to be accessed from the files that have been seen so far in "context" [10].Each file seen in a context has a probability indicating the likelihood that it follows that context.FMOC often prefetches multiple files for each prediction.The "additive accuracy" was defined to compare the performance between FMOC and LS [10].If the next file accessed is among those files prefetched, then the predicted probability of that file is added to the score of FMOC.The final score is then normalized by the number of events in the simulation trace to obtain the "additive accuracy" [10].The additive accuracy therefore indicates the likelihood that the next file actually referenced is among those predicted files.
Kroeger's study showed that using order higher than two resulted in negligible improvements so in this work we only examine the second order FMOC model (denoted as FMOC2).

Program-based Last Successor (PLS)
Lacking a priori knowledge of file access patterns, many file prediction algorithms use statistical analysis of past file access patterns to generate predictions about future access patterns.One problem with this approach is that executing the same set of programs can produce different file access patterns even if the individual programs always access the same files in the same order.For example, consider a system with a preemptive scheduler running two programs, ¢¡ and ¤£ , where ¥¡ accesses files ¦ , § , and ¨, in that order, and £ accesses files © , , and , in that order, and each file is accessed exactly once.While each program has a perfectly predictable access pattern and each file (after the first one in each sequence) follows exactly one other file in the program-based sequence, the system will see one of 20 different file access patterns ( "! ) depending on the exact timing of context switches in the system.In particular, with repeated executions of these two programs the history of file accesses observed by the system will vary considerably.
Suppose a file trace at some time shows pattern ¦ # § , and pattern ¦ $¨occurring 60% and 40% of the time respec- tively.A probability-based prediction will prefer predicting § after ¦ is accessed.If § and ¨tend to alternate after ¦ , then LS will do especially poorly.But the reason that pattern ¦ $ § and ¦ $¨occur may be quite different.For instance, in Figure 1, the file access pattern ¦ # § is seen to be caused by program ¡ , while the file access pattern ¦ $¨is caused by program £ .In other words, what is really behind the numbers 60% and 40% is the execution of two different applications, %¡ and &£ .After we collect this information (a set of pairs consisting of "program name" and "successor") for file ¦ , next time it is accessed we can predict either § or ¨depending on %¡ or ¤£ is accessing ¦ , or provide no prediction if ¦ is accessed by another program.Of course, if a particular program accesses multiple different files after each access of a particular file, then the program-specific last successor will change.The metadata of files in Figure 1 is shown in Table 1.
PLS does not have the problem of making incorrect predictions when different file access patterns generated from executing the same set of programs as discussed earlier in LS.However, there are cases where different executions of the same program will likely involve different sequences Executing system programs such as editors and compilers often falls into this category where PLS could make fewer correct predictions.This is due to different users usually edit and compile their own files.Consequently if we have the knowledge about which user is editing or compiling the files, we then can make predictions more accurately in this case.Database applications could also involve different sequences of file access depending on which user is executing the program.

Program-and User-based Last Successor (PULS)
PULS is a refinement of PLS.PULS takes the user information into account when it makes predictions.As stated earlier knowing which user is executing a particular program can make predictions more accurately than PLS in certain cases.In this section we will discuss how to implement PULS and explain why it can perform better than LS and PLS.
Probability can only tell us what patterns of file access are and how frequently they occur, but not why these patterns exist.Since files are accessed by programs, and programs are executed on behalf of users.Consequently, patterns of file access are largely decided by who is running what program in the system.
Figure 2 shows a case where PULS can outperform PLS.
Suppose file § is accessed after an access to file ¦ when the user A B¡ runs the program %¡ , while an access to ¦ will be followed by an access to when ¡ is executed by A £ .
In the meanwhile when A ¡ executes ¤£ , an access to ¦ is followed by an access to ¨, while an access to ¡ will follow an access to ¦ if ¤£ is executed by A instead.In this case, PLS will make fewer correct predictions if A ¡ and A %£ tend to execute ¥¡ alternatively.Similarly A ¡ and A will face the same problem when they execute £ .The situa- tion can get worse as there are more users execute the same programs in a system.However, no matter how many users executing the same programs, PULS can still make correct predictions in the similar cases of Figure 2.This is because PULS collects the information (a set of pairs consisting of "program name" and "user-successor") for file ¦ , next time it is accessed we can predict either § , ¨, , or ¡ depending on which user (A ¢¡ , A ¥£ , or A ) is running ¥¡ or ¤£ , or provide no prediction if ¦ is accessed by another program or user.Of course, if a particular program accesses multiple different files after each access of a particular file, then the programand user-specific last successor will change.
PLS can predict as well as PULS only in a single-user system, or in a system there are no users sharing any programs.The performance PULS over PLS will increase as more uses executing the same programs.In real systems, machines often host multiple users, PULS therefore could predict more accurately than PLS accordingly.

Table 2. Metadata of Figure 2 kept under the PULS model
File ' program name, user-successor( PLS and PULS can avoid the slow adaptation problem in probability-based prediction models.Probability-based models always predict the same file until the corresponding probability changes.Like LS, both PLS and PULS do not rely on probability so it can respond immediately as patterns of file access change. There are three issues that need to be addressed.The first issue is how to collect the metadata in terms of © program name, user-successor for each file.Programs are executed as processes, so we can just store the program name and user ID (uid) in the process control block (PCB).For each running program (say ) executed by a user (say A ), we also need to keep track of the file (say © ), which it has most recently accessed.When accesses the next file (say ) after © , we update the metadata of the © with © , A , and the next time that accesses © on behalf of A , PULS can predict that the next file accessed will be .
In the example of Figure 2, when ¡ (say executed by A ¡ ) accesses the next file (say § ) after its access to ¦ , we update the metadata of ¦ with © ¡ , A ¡ § , and next time ¡ accesses ¦ on behalf of A ¡ , PULS can predict that the next file accessed will be § .Similarly, ¦ also keeps © ¢¡ , A ¥£ as parts of its metadata.The metadata of files in Figure 2 is shown in Table 2.
The second issue is how big the metadata needs to be in order to make accurate predictions, which is not quite as simple as the first.Ideally, for each file we would like to record the name of every program that has accessed it before, along with the program-and user-specific successor to the file, so that we know which file to predict when the same program executed by the same user accesses the file again.In reality, this may be too expensive for files used by many different programs.Consequently, we may need to limit the number of © program name, user-successor pairs kept for each file.However, our simulation shows that the vast of majority of files are accessed by six or fewer programs and thus metadata storage is not a problem.
The last one is that if a program (say © ) eventually exe- cutes another program (say ), the information of is also added to the metadata of © , and it will be predicted accord- ingly in the future.
A few terms need to be clarified here.The first is that when we use the term "program" we mean any running executable file.Thus a driver program that launches different sub-programs at different times is considered by PULS to be a different program from the sub-programs, each of which is also treated independently.The second is that both "program name" and "file name" include the entire pathname of the files.This is important because different programs with the same name can access the same file and different files with the same name can be accessed by different programs, and these accesses must all be handled correctly.Evaluation In this section, we will discuss the trace data we used to conduct our experiments, explain why we choose the particular trace in our simulations, and finally how we compare performance of FMOC2, LS, PLS, and PULS.

Simulation Trace
The key requirement of the file trace we need is the information of corresponding programs and users for events of file access recorded in the trace.User information sometimes is available in some traces we study.However, the program information cannot be obtained either directly from the traces, or incorrectly by reprocessing the data in all the file traces we have access to, except the DFSTrace from the Coda project [8,15].Therefore we select DFSTrace to evaluate the performance differences among models we compared.
File traces in DFSTrace were collected from 33 machines during the period between February of 1991 and March of 1993.We used data covering between 7 and 13 months from four machines, Barber, Mozart, Dvorak, and Ives.Barber was a server, Mozart was a desktop workstation, Dvorak had the highest percentage of write, and Ives hosted the most users.The periods of data selected from Barber, Mozart, Dvorak, and Ives are 11, 13, 7, and 7 months respectively.Research has demonstrated that the average life of a file is very short [1].Besides, DFSTrace does not contain events of READ or WRITE most of the time.Therefore, instead of tracking every READ or WRITE event, we track only the FORK, EXECVE and OPEN events in our simulation.
As mentioned above, PULS needs to be able to determine the name of a program and the user in order to generate its predictions.Because we cannot obtain the name of any program or user that started executing before the beginning of the trace, we exclude EXECVE events forked by processes whose user IDs (uid) are unknown.This is because DFSTrace only reports uid of the child process in the FORK event.By catching the uid of the new child process, we have the uid we need for all the following EXECVE and OPEN events from that child process.We also exclude OPEN events initiated by any process ID (pid) which started before the beginning of our trace.Intuitively this filtering has no effect on the results of our experiments because the filtering is based only on the time at which the program began.In a real system such filtering is not necessary because all program names and user names are known.

Methodology of Performance Evaluation
We score PULS, PLS and LS in the same way by adding one for each correct prediction and zero for each incorrect prediction.We normalize the final scores of these three models by the number of predictions, not by the number of events as in the FMOC2 model.This is because the first time that a file is accessed by a program there is no previous successor to predict and so the failure to make a prediction the first time cannot be considered incorrect.Since our simulation trace is very long (between 7 and 13 months), it turns out that the effect of this compulsory error is negligible and does not affect the comparison of predictive accuracy among these models.The final score (in percentage) is referred as "predictive accuracy" in our experiment.So a predictive accuracy of % means that in average there are correct predictions out of 100 predictions.As explained earlier, FMOC2 tends to predict multiple files at a time, the score of FMOC2 is the "additive accuracy", which can be viewed equally to the "predictive accuracy" used in our experiments.In the meanwhile for models predicting only one file at a time such as the other three models compared, predictive accuracy is indeed same as additive accuracy.

Comparison of Predictive Accuracy
We used the filtered trace data to evaluate FMOC2, LS, PLS, and PULS. Figure 3 shows that PULS has the highest predictive accuracy in all machines.One pitfall in comparing prediction models in terms of predictive accuracy is that higher predictive accuracy does not assure the success of a model because the scores are usually normalized by the number of predictions made, which does not include those cases where no prediction was made.Consider two prediction models, ¦ and § .If ¦ makes 40 correct predictions, 40 incorrect predictions, and does not make a prediction 20 times out of a total of 100 file accesses, then ¦ 's predictive accuracy is 50%.Suppose § makes only 2 correct predic- tions, 1 incorrect prediction, and does not make a prediction 97 times.§ 's predictive accuracy is 67%, but model § is almost useless in practice.
Clearly, in order to examine the real performance of a prediction model, we need other information besides predictive accuracy.Thus, we use LS as the baseline to evaluate the detailed performance of other models in three categories.The first category is the percentage of total predictions (including correct and incorrect predictions) made by PULS as compared with LS.This percentage should not be to too small, otherwise PULS may be an unrealistic model just like the model § above.The second is the percent- age of correct predictions made by PULS as compared with LS.This number should be as high as possible.The last category is the percentage of incorrect predictions made by PULS as compared with LS.Ideally this percentage should be less than 100%, indicating that PULS makes fewer incorrect predictions than LS.

Performance by Category
We cannot do the same comparison with FMOC2 due to the nature of the FMOC discussed above.Figure 4 displays the performance in the category of total prediction.It shows that the percentage of events where a prediction was made by PULS is only about ten percent less than that of LS.This is close enough to consider PULS to be a practical prediction algorithm in terms of the number of predictions it makes.The percentage of correct predictions is shown in Figure 5.Both percentages for Barber and Ives from PULS are over 98% of the numbers from LS, and it is over 99% for Mozart.For Dvorak, PULS makes more correct predications than LS. Figure 5 demonstrates that both PULS and PLS can do roughly as well as LS in correctly predicting files.Figure 6 shows the percentage of incorrect predictions.To get a closer look at the relative performance in LS, PLS, and PULS in terms of reducing incorrect predictions, data in Figure 6 is normalized to LS and displayed in Figure 7. Figure 7 clearly shows that PULS makes fewer incorrect predictions than both LS and PLS.PULS reduces about 24% of incorrect predictions compared with LS.In the meanwhile PULS cuts incorrect predictions about 3.5% more than PLS can reduce in some cases.This explains why PULS has the highest predictive accuracy among LS, PLS, and PULS seen in Figure 3.As we discussed before, incorrect predictions come with a cost, and avoiding this cost directly translates into better system performance.
The reduction of incorrect predictions in PULS is interesting enough to be worthy of further exploration.Since the number of predictions made by PULS is only about ten percent less than LS, and one percent less than PLS, and the number of correct predictions of these three are roughly the same, we conclude that PULS makes no prediction more often than both LS and PLS .We collected the percentages of cases where no prediction was made by PULS and PLS, and compare them with LS, and the results are displayed in Figure 8, which confirms this surmise.Figure 8 shows that the percentage of events where no prediction was made by PULS is roughly three to seven times higher than that of LS, and is about 1.3 to 1.7 times higher than that of PLS.

Performance Increase for Multi-Users
Events in each of the four traces are generated by only a small number of users.We believe PULS can surpass LS more as the number of uses goes up in the system.To simulate a system which hosts more users, we synthesized a one-month trace by combining events within the same month from each of the four traces.Because we define a correct prediction as only if the predicted file is the next file needed by the entire system, not by each individual program, therefore both LS and PULS are expected to generate lower predictive accuracy from the synthesized trace.We calculate the "total-weighted" predictive accuracy by sum- ming up the four "individual-weighted" predictive accuracy, which is the product of multiplying the predictive accuracy of one trace by the percentage of events in the synthesized trace that come from that particular trace.Table 3 and Table 4 show the results for PULS and LS respectively.Take the Table 3 for example, the original predictive accuracy for Barber is ¢¡ ¤£ ¥ ¢ ¦¥ ¢ § ©¨.The number of events from Barber constitutes ¦ £ !¨of the total events in the synthesized trace.So the "individual-weighted" predictive accuracy of Barber is "! £ ¨( ¢¡ ¤£ ¥ ¢ ¥ © § © ¢ £ !¨).The "total-weighted" predictive accuracy of PULS, ¤ ¢£ !¨, is the sum of each "individual- weighted" predictive accuracy from the four traces.The "total-weighted" predictive accuracy practically can be viewed as the upper bound of the predictive accuracy that PULS can generate from the synthesized trace.Table 5 compares the practical upper bound of predictive accuracy that PULS and LS can achieve, and the actual predictive accuracy both generate.The results show that PULS can reach !£ ¦ ¨o f its practical limit, while LS can only reach  ¢ ¤£ ¥ ¢ § ©¨.This demonstrates that PULS could outperform LS in a greater scale when there are more users in the system.One last note about this evaluation is that the synthesized trace we produced is only one-month long.We expect the performance increase that PULS over LS will be more noticeable when a longer trace applied

Cache Performance
In addition to predictive accuracy we also want to know how PLS and PULS perform in terms of cache hit ratio.We set the cache size according to the number of files it can hold for two reasons.The first is that file size is usually small, so the entire file can often be prefetched into cache [18].The second is that in the case of large files, sequential   [21], so it is an appropriate candidate used to evaluate the effectiveness of predicting algorithms in terms of cache hit ratio.Figure 9 shows that when using PULS prediction, the cache always performs better than when using LRU alone, regardless of cache size, and in some cases even better than a cache up to 40 times larger.For this data, because records in each trace are created only by a small number of users, therefore the cache performance of using PLS and PULS are essentially the same.Part of the reason for the dramatic performance improvement of LRU with PULS is the fact that an incorrect prediction made by PULS, one that does not correctly predict the next file to be accessed, will still provide benefit if the file is subsequently accessed while it is still in the cache.Because PULS makes program-and user-based predictions, its incorrect predictions are much more likely to be for a file to be accessed in the near future than are predictions made by non-program-based models, which may predict a file accessed by a program that is no longer even running.In the meanwhile an incorrect prediction for one program executed on behalf of a particular user may more likely be accessed by other programs the same user is running than an incorrect prediction from programs run by others.In other words, the incorrect predictions by PULS are more likely to be used in the near future and are therefore less wrong than those made by other models.The earlier graph displaying predictive accuracy show performance for an effective cache size of one file and therefore do not show the performance benefit of this second-chance effect but Figure 9 clearly shows this effect.In real systems where multiple files can fit in memory at once, the performance will benefit accordingly.

Conclusions and Future Work
As the speed gap between CPU and the secondary storage device will not be narrowing in the foreseeable future, file prefetching will continue to remain a promising way to keep programs from stalling while waiting for data from disk.Incorrect prediction can be costly in practice.Reducing the number of files incorrectly predicted is very important in terms of saving both cache space and disk bandwidth.Our simulations show that using program and user information can generate good results in predicting files, especially in eliminating the cases of incorrect prediction.Therefore, both the file prediction performance and cache hit ratio can be improved.
File accesses are driven by the users and programs using them, not by previous access patterns.By tracking programs and users initiating file accesses, we successfully avoid many incorrect predictions.About 24% of incorrect predictions can be reduced as compared with LS in some cases as our results demonstrate.Therefore, the overall performance penalty in the system caused by incorrect predictions can be significantly reduced.We also compare the cache hit ratios of LRU with and without PULS.The results show that with PULS, LRU can deliver a much higher cache hit ratio.
The DFSTrace is not very new.We chose it because it contains the program and user information, which is absolutely necessary to the PULS model.In the future, we would like to collect our own traces that PULS can use, and examine how PULS performs under more recent traces.

£
Supported in part by the National Science Foundation award number PO-10152754.

Figure 1 .
Figure 1.Program-based Last Successor model

Figure 2 .
Figure 2. Program-and User-based Last Successor model

Figure 7 .
Figure 6.Incorrect predictions made by LS, PLS, and PULS

Figure 8 .
Figure 8.No predictions made by LS, PLS, and PULS