Exploiting User Activeness for Data Retention in HPC Systems

HPC systems typically rely on the fixed-lifetime (FLT) data retention strategy, which only considers temporal locality of data accesses to parallel file systems. However, our extensive analysis based on the leadership-class HPC system traces suggests that the FLT approach often fails to capture the dynamics in users' behavior and leads to undesired data purge. In this study, we propose an activeness-based data retention (ActiveDR) solution, which advocates considering the data retention approach from a holistic activeness-based perspective. By evaluating the frequency and impact of users' activities, ActiveDR prioritizes the file purge process for inactive users and rewards active users with extended file lifetime on parallel storage. Our extensive evaluations based on the traces of the prior Titan supercomputer show that, when reaching the same purge target, ActiveDR achieves up to 37% file miss reduction as compared to the current FLT retention methodology.


INTRODUCTION
HPC systems usually o er a large, shared scratch space -a le system that provides high performance and parallel le access to applications. Although this scratch space is not designed to store data permanently, many users use the space as a normal le system where they store their data les without any plan on releasing the space voluntarily. On the other hand, as the upgrade process of an HPC storage system usually takes a vast amount of investment, time and e ort [7,15,34,47], the total capacity of the scratch space tends to remain xed for a considerably long time after the deployment of a system. Additionally, HPC applications constantly generate a tremendous amount of data [1,2,9,14,[21][22][23]35], making it necessary to manage storage resource e ectively [11,13,18,19,38,48] with, in particular, "data retention" -a process of retaining useful les and purging unimportant les to improve the utilization of storage space.
Over the years, various data retention methodologies have been proposed with multiple transitions between data retention criteria. Existing data retention methodologies include the xed lifetime strategy (FLT) where les are purged according to a xed de nition of le lifetime, the value-based approach where les are purged according to various de nitions on le value, and the "scratchas-a-cache" approach where the scratch space is used as a cache for job executions. However, the value-based approach remains conceptual because the inconsistent de nitions of le value among its variants compromise the applicability of this approach. Also, the scratch-as-a-cache approach is problematic because it requires intensive le loading and o -loading at the beginning and the end of a job execution and hence can signi cantly increase the job execution time, making it more complicated to craft the job scheduling algorithm. In fact, to the best of our knowledge, there is almost no sign of the value-based and the scratch-as-a-cache approaches in practice.
Today, the xed-lifetime (FLT) data retention strategy is still the dominating data retention solution being used in the vast majority of HPC systems, while other approaches are rarely found in real practice. Therefore, we take the FLT approach as the foundation of our discussion in this study. In Table 1, we show several examples of FLT at di erent facilities. With this strategy, the data retention process can be automated by periodically scanning and purging the stale les that have exceeded their lifetime. As a result, the les that are recently accessed within the speci ed le lifetime will be retained. The underlying assumption of this retention solution is that the most recently accessed les will be more likely accessed again in the near future. In other words, the FLT data retention only considers the temporal locality of data les in scratch space.
While it is widely recognized that the temporal locality is of great importance in storage management, we observe that the scratch  [27] Purge any 120-day old OLCF [31] Purge any 90-day old TACC [40] Purge any 30-day old NERSC [28] Purge any 12-week old space management needs to consider more than the temporal locality. In fact, the le access pattern in the scratch space is often in uenced by users' behavior. For example, in a project execution that drives multiple runs of HPC applications, due to various factors such as temporary administrative suspension of the project or task switching in users' work ow, the users may not be able to continuously work on their data les. Therefore, it is very common that users may leave their data les untouched for quite a long time and then come back to access these les. Thus, the FLT data retention strategy often leads to undesired le misses for users. Since many datasets contain a large number of les and the size of each data le can be large too, the procedures of collecting data les are usually complicated and hard to repeat. Therefore, recovering these les is not only expensive but also time-consuming, frustrating and, sometimes, even impossible. In addition, knowing the xed le lifetime speci ed by the FLT, some users can game the system by "touching" their les periodically [26], as long as the lifetime of their les is "renewed" before xed-lifetime data retention wipes out these les. This trick can lead to underutilized storage space if the users only reserve the les but rarely use them.
In summary, the temporal locality alone is insu cient for evaluating the le access pattern in the scratch space. Therefore, the FLT data retention is often unable to capture the dynamics of users' behavior with a xed le lifetime setting and hence leads to numerous problems such as undesired le misses, unnecessary hassle for expensive le re-transmission process, and underutilized scratch space. In this study, we rethink the data retention problem from a holistic perspective that focuses on evaluating activeness of users that use a HPC system, access their les, etc. and activeness of users producing outcomes (i.e., completing jobs and tasks, producing analysis results, publications using a data set, etc.). Based on such a perspective on users' activeness, we propose ActiveDR, an activeness-based data retention strategy that considers activeness of users at the core of its design. ActiveDR has an e cient activeness evaluation algorithm and measures the frequency and the impact of user activities within a speci ed number of periods. Then it ranks the activeness of each user during these periods. Ac-tiveDR categorizes users by the activeness and purges les based on that. It rewards active users with extended le lifetime based on their activeness rank. The retrospective le purging mechanism of ActiveDR ensures that the speci ed purge target will be guaranteed while prioritizing purging inactive users' les. Overall, with a user-centric view, ActiveDR is a unique and e ective data retention solution that, to the best of our knowledge, promotes active and fruitful use of the HPC system.
We have evaluated the e cacy of ActiveDR using two years of system traces from Titan supercomputer and its Spider II storage system. Our evaluation result demonstrates that, when reaching the same purge target, ActiveDR e ectively reduces up to 37% of le misses by retaining up to 213.47% more data for active users, as compared to the xed-lifetime data retention method. Also, with ActiveDR, up to 95% active users are exempt from the le misses by the data purge operations. Furthermore, the Ac-tiveDR takes less than 500MB memory footprint when evaluating the system traces and its activeness evaluation process nishes rapidly, within one second. Overall, the ActiveDR takes about one hour to nish the entire data retention process for over 935 million les. Our prototype release of ActiveDR can be found from https://doi.org/10.5281/zenodo.5168853.
The rest of the paper is organized as follows. In Section 2, we introduce state-of-the-art data retention strategies and discuss their drawbacks in detail. We then discuss the design principle of Ac-tiveDR and detail its design in Section 3. After presenting the experimental result of our evaluation in Section 4, we conclude our work in Section 6.

RELATED WORK
In numerous studies on storage resource management [38,39], the data retention was performed by de ning a xed lifetime of the les and then monitoring the le access time. This xed-lifetime (FLT) data retention strategy only relies on the temporal properties of the les. Backed by the temporal locality theory, the FLT approach is widely accepted as it is believed that a data le will not be accessed again if it has not been accessed for a long time. In addition, the FLT approach is simple and easy to be implemented; hence FLT is used in most HPC systems.
Several studies proposed data value-based approaches [43,48], which include more le attributes into the data retention criteria, such as the le type, le size, le age, le access frequency or a combination of them. The value-based approaches then led to a series of studies on nding the true de nition of the le value [4-6, 8, 10, 17, 25, 37, 39, 41, 42, 45, 49]. However, as discussed by Attard et al. [3], "there is no consensus on the de nition of data value", and the methodology of assessing or quantifying the value of data is currently incomplete. This drawback leads to limited interoperability of value-based approaches as their data value speci cations remain di erent. Consequently, value-based approaches may introduce additional complexity in nding the most appropriate de nition of le value for a particular HPC storage system and hence its practicality is substantially compromised. Therefore, we exclude the value-based approach from our following discussion not only because it is impractical but also because it would not be objective if we pick any of its variants for further discussion.
Monti et al. [26] proposed a "scratch-as-a-cache" data retention solution where an HPC scratch space is considered as a cache for jobs running on HPC systems. In this solution, a data le can only stay in a given scratch space if an application is using it. While this solution may be helpful in restoring the scratch space in a timely manner, it may cause frequent loading of les from an archive and purging operations that are time-consuming. Also, frequent data loading and o -loading procedure on large les can impose a heavy I/O burden on the storage system as well. Hence we exclude this approach from our following discussion as well since it may signi cantly lengthen the execution time of an application (or even the entire work ow) and may introduce unnecessary performance challenge to the storage system.
While both value-based approach and scratch-as-a-cache approach are rarely used in real practice due to their limited practicality, in a majority of HPC storage systems today, the most widely used data retention solution is still the xed-lifetime (FLT) data retention methodology. Essentially, the FLT method retains the les that are accessed within a speci ed amount of time called " le lifetime". The underlying assumption of FLT is that the most recently accessed les will be accessed again in the near future (or within the le lifetime, to be speci c).
However, users may not access les in the speci ed lifetime that the FLT data retention methodology expects. Therefore, the FLT is often unable to capture the dynamics in users' behavior and hence results in undesired le removal. In some cases, the users may process data les through multiple iterations and they need to access di erent sets of les back and forth. While some data processing iterations may last for months and only involve a particular set of the les, other les that are useful to future data processing iterations may be purged by the FLT retention process. Thus, the users may have to reload or restore those les in order to proceed with further data processing iterations. In some other cases, the users may be temporarily distracted from their data processing tasks due to unanticipated interruptions. For example, some users may nd it necessary to temporarily hold their project, to conduct additional eld studies or to collect additional data right after storing some data in scratch space. If the eld study or the additional data collection process takes longer than the speci ed le lifetime, the FLT approach will purge the data les that are previously loaded, causing le misses for the users when they access their data les.
To verify how frequently and signi cantly the FLT method may introduce le misses to users, by courtesy of Oak Ridge Leadership Computing Facility (OLCF), we ran an emulation on the job and system traces during 2015 and 2016 at OLCF. We formulated a virtual le system by collecting the paths of all accessed les from the command lines in the job submission traces. We emulated the le accesses in 2016 while applying the FLT method with 90-day le lifetime and 7-day purge trigger interval. From the result plotted in Figure 1, we can see that, during the 366 days in 2016, the le miss   ratio uctuates randomly around 5%, between the lowest 0% and the highest 95.66%. For over 120 days, the le miss ratio is between 1% and 5%, and the le miss ratio runs between 5% and 30% for 99 days. Although the le miss ratio exceeds 30% for only 39 days, the result shows that the users may intermittently su er from 5%-100% daily data access interruption during 138 days, almost half of the entire year. As there is no mechanism for users to recover their missing les automatically, it can take hours to days for the users to recover their data by either re-transmission or re-generation of the data, which will cause signi cant amount of network tra c, computing cycles and even project delay. In fact, the scratch space is typically built to serve the short-term high-performance parallel accesses from batch jobs [32]. If the users need to keep their les for a long-term data processing task, they often need to manually manage their data les, migrating them to archival storage and loading them back to scratch space when needed, which is time-consuming and inconvenient. According to the observation reported in a prior study [26], some users may even game the FLT by "touching" their data les periodically to avoid undesired le purge against the temporarily unused data les. Such practice can lead to underutilized storage space. We observe that the activeness of di erent users may vary signi cantly, and the FLT data retention methodology ignores such variations. For example, some users working on data analytic workload, require an increasing amount of scratch space from time to time, while other users may only access their data once in a while. Therefore, it is time to devise a novel data retention solution which avoids undesired le purge as much as possible for active users, boosts the overall utilization of the storage space, and encourages fruitful usage of HPC systems.

ACTIVENESS-BASED DATA RETENTION
Being aware of the limitations we observed from the FLT data retention method, we envision that a better data retention solution should consider the le availability to the active users as well as the overall utilization of HPC systems towards fruitful outcomes. Therefore, the activeness of users should be well considered in the data retention solution. In addition, such a data retention solution should be able to integrate with the current data management practices in an automated fashion and does not require extensive tuning or training e orts from system administrators. Meanwhile, the solution should be e cient so that the purge decision can be made in a timely manner without taking a signi cant amount of memory. In this study, we introduce a novel activeness-based data retention solution, or ActiveDR in short, to address the limitations of FLT and to meet data retention needs in HPC systems.
Di erent from the le-centric FLT data retention method, Ac-tiveDR considers the activeness of users at the core of its design. As shown in Figure 2, ActiveDR rst evaluates the activeness rank of users by evaluating the operation activeness and the outcome activeness. Then it classi es all users into four classes, i.e. "Both Active" , "Operation Active Only" , "Outcome Active Only" and "Both Inactive." By scanning the user directories in an ascending order of the user activeness, ActiveDR prioritizes active users over inactive users in retaining their les. In other words, ActiveDR cuts back the le lifetime of inactive users and rewards active users with more le lifetime using the activeness rank. The ActiveDR solution is designed to be automated. Administrators only need an initial setup, then the remaining procedures are automated. Additionally, ActiveDR also supports the commonly needed purge exemption  Figure 2: Overview of ActiveDR design. ActiveDR rst evaluates the activeness rank of users and classi es them into four classes, i.e. Both Active, Operation Active Only, Outcome Active Only, and Both Inactive. Then it scans user directories in an ascending order of the user activeness and prioritizes active users over inactive users in retaining les, i.e. adjusts the le lifetime of inactive users and rewards active users with more le lifetime using the activeness rank. ActiveDR supports the purge exemption feature.
feature, which allows the administrator to specify a list of les that are requested to retain and skip over these les.
When designing ActiveDR, we consider that system administrators need to focus on daily operations and occasional maintenance and hence should spend minimal e ort in tuning system management software. Therefore, in our design, we do not attempt to predict users' future activeness or future le access patterns because users' activity is hard to be predicted precisely, if not impossible. Although there are many studies on predicting user behaviors using machine learning (ML) methods [12,36,46], they all require a complicated training process which makes these methods expensive for a rapid evaluation of users' activeness. Additionally, tuning the ML models can be a challenging task for system administrators. Furthermore, the result of many ML approaches is not as intuitively explainable as what system administrators need.

Activeness-based Perspective
With the goal of capturing the mutual impact between users' activities and le accesses, we consider the data retention problem from a novel activeness-based perspective where users' activities are categorized into two dimensions: operations and outcomes. Our de nition of an operation applies to a wide range of user activities performed on the system, as shown in Table 2. These operations re ect the activeness of users and thus change the priorities in the data retention process. Likewise, an outcome refers to an accomplishment users have achieved by using the HPC system, or, in other words, what the users produce or generate after performing the operations on the system (examples are also shown in Table 2). The consideration of operation and outcome activities is the hallmark of the activeness-based perspective. It ensures the fairness of user activeness evaluation and prevents the "periodic-le-touch" tricks. Since many HPC facilities do keep track of (or, at least are considering tracking) user operations and outcomes [16,20,24,29,33,44], it is a fair assumption and feasible approach to include the consideration of operations and outcomes for the data retention solution.

Outcomes
Successful completion of a job Successful completion of a task in a work ow Dataset generated from a job execution Publications resulted from a job output ...
The advantage of activeness-based perspective is its inclusiveness for a diverse spectrum of users' activities with a particular focus on HPC system and its storage space, which allows exible choice of user activities for user activeness evaluation. In turn, this approach makes it possible to provide a holistic consideration of di erent factors including temporal locality, spatial locality, users' behavior and system utilization. For example, by capturing data transfer or data sharing activities among users, the shared use of data les are considered. By capturing le access activities, the temporal locality is considered. By capturing activities that access users' directories, the spatial locality is considered. Next, we discuss the model of activeness evaluation.

User Activeness Evaluation
To provide a simple and e ective solution, the user activeness evaluation algorithm is designed to unify the activeness measurement of di erent user activities. For any type of activity, the user activeness evaluation algorithm only needs two essential measures which are the time and the impact of the activity. For a speci c type of operation, the time and the impact can be concrete metrics. As an example, for a job submission activity (operation), the time can be the job submission time or the job start time. The impact can be the total run time or the CPU hours. Similarly, for an outcome activity such as a publication, the time can be the time of the publication and the impact can be the citation count of the publication. With such a uni ed activeness measurement model, we are able to quantify and calculate the activeness of users based on their activities. In other words, operations and outcomes can be con gured by system administrators based on what they keep track of and with weights to quantitatively measure the impact. Please note that such a setup is only a one-time con guration. ActiveDR uses these information to calculate the activeness to make the data retention decision, which provides an optimized control of purging les instead of solely based on timestamps.
We now introduce how to calculate the activeness of a user. As shown in Table 3, suppose a user may have types of activities, we denote the set of all activity types as = { 0 , ..., −1 }. For a certain activity type , we consider that there is a set of activities = { 0 , ..., −1 }. For an activity , we consider its activeness to be . Since the activeness of each activity is measured by its impact, the value of is a speci c, prede ned value con gured by system administrators. We consider that all activities of type are distributed among a set of periods, and each period contains days. While the period length is a con gurable parameter, for the activity set = { 0 , ..., −1 } of type , the total number of periods can be calculated as: where function _ converts the period length into the same unit as the activity timestamp. We further calculate the average activeness of all activities of type across all periods: Afterwards, we calculate the activeness of all activities in during each period . For a period , let = be the overall activeness of all activities { | ∈ [0, )} of type occurred in this period, we calculate the activeness ratio of the activities of type in this period: When the activeness ratio ≥ 1, we consider that the user is active on type activities during period . When the activeness ratio < 1, we consider that the user is inactive on type activities during period .
Let be the current time when the user activeness evaluation begins. For an activity occurred during the period , we calculate the index of the period by the following equation: Figure 3: Time-series activeness rank vector when = 5.
Then, a time-series activeness rank vector is built as shown in Figure 3. The length of the vector is equal to the total number of periods that the speci ed type of activities span over, i.e.
. Each element in the vector represents an activeness rank of the corresponding period. The ActiveDR is designed to value those users who remain active recently. Therefore, at time , we expect the activeness rank acquired from a closer period to have a larger impact against the overall activeness rank. As such, we consider the activeness rank from the th period to be ( ) . The more the period is closer to the current time , the larger the value of will be, and hence the more the activeness ratio in period contributes to the overall activeness rank. This feature is guaranteed by the monotonic property of exponential function.
Finally, after the activeness rank vector is derived, the overall activeness rank of a particular activity type is calculated as: With this equation, ActiveDR guarantees that the activeness rank Φ is either in the range [0, 1) or in the range [1, +∞). We consider that when 0 ≤ Φ < 1, the user is inactive for the activities of type , and when Φ ≥ 1, the user is active for the activities of type . In addition, the larger the value of Φ is, the more active the user is for this type of activity, and vice versa.
In ActiveDR, we consider two classes of user activities, operations and outcomes. For types of operation activities and types of outcome activities, we can perform the following calculation to derive the overall operation activeness rank Φ and the overall outcome activeness rank Φ : where denotes an operation activity and denotes an outcome activity. It is noteworthy that both Φ and Φ will be within either the range [0, 1) or the range [1, +∞) since the activeness of each activity in these categories is within either of these two ranges as well.
Please note that, in ActiveDR design, we have considered the case that outcomes may need longer time to be yielded. That is why the user activeness evaluation model is based on consecutive periods, instead of just one period (please see Equations (2) - (6) in this section). Additionally, the activeness ratio of each activity type is calculated as an average value in each period, as shown in Equation (3). Therefore, long jobs would not be penalized because of their long span of run time.

User Classi cation
Based on user activeness evaluated, ActiveDR classi es all users into four categories, as depicted in Figure 4. In each category, users are sorted according to their activeness rank. The data retention procedure will scan users' directories based on these four di erent user activeness categories. In ActiveDR, the operation activeness rank Φ is given higher priority. Therefore, users are sorted according to their operation activeness rst and then are further di erentiated according to their outcome activeness.

Data Retention
To run the data retention procedure, the administrator needs to provide an initial le lifetime for new users and the both-inactive users so that the les of these users will follow the initial le lifetime setting and will not be purged when they are scanned the rst time. The administrator also needs to provide a purge target indicating the space utilization that should be reached.
Optionally, the administrator can specify a list of reserved les for the purpose of le purge exemption. ActiveDR reads the le reservation list and stores the paths of the reserved les into a compact pre x tree. When scanning the les of each user, ActiveDR can e ciently determine if the path of an encountered le is in the le reservation list and skip over the reserved les for the retention procedure. Please note that we consider the le reservation list as a contract between users and the system administrator. The paths of the les on the reservation list are not supposed to change. If a user change the le path of a previously reserved le without notifying the system administrator, we consider it means that the user has cancelled the reservation of that le.
Di erent from the FLT data retention solution where the data les are scanned in the order speci ed by the system, ActiveDR scans users' directories in an ascending order of the users' activeness rank. First, ActiveDR will evaluate users in both-inactive and outcome-active-only categories. Afterwards, ActiveDR visits the other two categories, i.e. operation-active-only and both-active, in an ascending order of the outcome activeness.
When visiting users in a certain activeness category, ActiveDR scans each le in the user's directory. For each le that is not reserved, ActiveDR adjusts the lifetime of the le by multiplying it with the activeness rank of the user (shown as " le lifetime adjustment" in Figure 2). The more active the user is, the higher chance his/her les will survive from being purged. Suppose the initial le lifetime is days, the adjusted le lifetime of le owned by a user is calculated by the following equation: At time , ActiveDR examines the access time of le and purges the le as long as − > . At any time when the purge target is reached, ActiveDR will stop the data retention procedure. To ensure that active users are protected from le purge to the maximum degree, each time when nishing the le purge scanning of an activeness group, ActiveDR will test if the purge target is reached or not. If not, ActiveDR will retrospectively work on that activeness group for a speci ed number of times (currently ve times in our implementation) and decrease the user activeness rank by a prede ned certain percentage each time (currently 20% in our implementation). If the purge target is still not reached after all activeness groups are tried, ActiveDR will stop and report to the administrator via speci ed reporting mechanism.
In the current design, the lifetime of a le is only extended by its owner's activeness rank even though the le may be shared by other users. We keep such a design because we consider that the owner of the le should be responsible for the les when he/she shares them. We consider that such a design principle can help with suppressing the complexity of the solution.
When ActiveDR is used for a new system or in the case where new users accounts are created in HPC systems, it is highly possible that no activity information is available for all or some users. To avoid the les of these users being purged immediately after the rst run of ActiveDR, we set the initial user activeness rank of all activity types to be 1.0. This handling ensures that new users' les are provided with the initial le lifetime and will not be purged during the rst data retention process.

EVALUATION
In this section, we rst introduce our dataset, experimental platform, and our evaluation procedure, then we introduce our experimental results, including the data retention results and the performance of ActiveDR. to 2016 by the users at OLCF (the publication list is provided by OLCF too). As the le size is not directly available and we can only get the number of stripes from the metadata snapshot, we generate a synthesized le size for each le in the snapshot according to the best striping practice of the Spider le system suggested by [30].

Experimental Platform. We conducted our emulation-based evaluation on the Cori supercomputer hosted at the National Energy
Research Scienti c Computing Center (NERSC). Speci cally, we used the Haswell compute nodes for our experiments. Each Cori Haswell compute node has two 16-core Intel ® Xeon™ processors E5-2698 v3 ("Haswell") at 2.3 GHz and 128 GB of DDR4 2133 MHz memory. The peak performance of each compute node is 1.2 TFlops. The compute nodes use GPFS for its home directory and multiple Lustre le systems as scratch spaces. We used a 30 PB Lustre le system with over 700 GB/s peak I/O bandwidth for our evaluation. Our ActiveDR implementation is written in Python along with the mpi4py package to enable parallel emulation. Other python packages we used include pandas and numpy.

Evaluation Procedure.
In our evaluation, we designed an emulation-based experiment to verify the e ectiveness and eciency of ActiveDR, in comparison with the FLT data retention method. Throughout the entire experiment, we set the purge target to be 50% of the total storage capacity (the total synthesized size of all les in the last weekly metadata snapshot of 2015). Also, we used the job scheduler logs as the input for operation activities, and we used the research publication list as the source of outcome activities. In particular, for each job, we use the core hours (number of CPU cores multiplied with the job duration) as the activeness score. Also, we derive the activeness score of each publication by calculating the multiplication of adjusted citation count and the rank of the user in the author list . Given the actual citation count , the total number of authors of each publication and the index of the author in the author list , the activeness of each publication can be calculated as follows: Our evaluations were conducted using these two activity traces in hope to show the e ectiveness of ActiveDR in the situation where the outcome activities are not directly related to the operation activities, but please note that ActiveDR can work with di erent types of activities, as discussed in Section 3.1.
To initialize our experiment, we rst load the last weekly metadata snapshot in 2015, extract the le paths and index them into a compact pre x tree along with the synthesized le size information generated. The compact pre x tree serves as a virtual le system in our emulation. It allows us to test if a given le path matches with an existing le and also enables us to e ciently retrieve the size information of each le with the corresponding le path. We then replay the application logs of 2016 and emulate the le accesses and data retention processes. We run ActiveDR and FLT solutions with 90-day le lifetime and a 7-day purge trigger interval on the weekly metadata snapshots. These settings were previously used in the Spider II storage system at OLCF, and we reuse such settings to restore the real data retention situation as much as possible.
During the experiments, each time when the ActiveDR data retention is triggered, we rst run a preparation procedure to load the corresponding weekly metadata snapshot as well as the activity traces, then evaluate the user activeness and store the corresponding user activeness in memory. Each time when our emulator encounters a le path during the process of replaying the application logs, we rst test whether the le is already indexed in the compact pre x tree. If not, we count a le miss; otherwise, we follow the data retention procedure in our design to remove the le. For FLT data retention, we replay the logs and purge the les as in the logs. There is no preparation procedure for FLT.
Since the Spider metadata snapshot has already been a result of the 90-day FLT data retention, we also tested both data retention solution with 7-day, 30-day and 60-day le lifetime, which are shorter than the 90-day le lifetime. This evaluation allows us to observe how both data retention solutions perform with di erent le lifetime con gurations. In addition, we still include the result with the 90-day le lifetime con guration in order to understand what percentage of le misses can be reduced by using ActiveDR as opposed to FLT.
By utilizing the mpi4py package, we are able to use multiple processes working together to scan the metadata snapshot. Each process maintains a series of counters to record the number of purged/retained les, the total size of the purged/retained les, and the number of users whose les are purged/retained, etc. Meanwhile, we set multiple probes to monitor the running time and the memory consumption of the program.

User Activeness
ActiveDR evaluates user activeness before carrying out the data retention procedure. As shown in Figure 5, we can see that the entire user space can be divided into four categories as shown in the activeness matrix. Among 13,813 users, only 0.4%-0.9% users are in the both-active category. The percentage of the operationactive-only users slightly increases from 1.1% to 3.5% as the period length grows from 7 days to 90 days, and the percentage of the outcome-active-only users slightly declines from 3.4% to 2.9%. Most of the users are both inactive (accountable for 92.7% -95%). For the vast majority of users who are inactive for both operations and outcomes, their les are considered to be the high-priority candidate for purging. Therefore, when a speci c purge target is given, ActiveDR takes advantage of such highly-skewed user distribution among di erent activeness levels and start the data retention process from purging the les of these inactive users. As compared to FLT, ActiveDR can reach the purge target with more les purged from inactive users. Therefore, more les of active users are expected be retained. FLT ActiveDR Figure 6: File miss ratio distribution by number of days Figure 6 shows a comparison between FLT and ActiveDR in terms of the le miss ratio distribution throughout 366 days in 2016. As can be seen from the gure, with ActiveDR, the number of days with 1%-5% le misses is roughly reduced by 10% and the number of days with 5%-10% le misses is almost reduced by half. Overall, the number of days with more than 5% le misses is reduced by 31%, from 138 days to 95 days § . This result shows that, on a yearly basis, ActiveDR reduced the total time during which users may randomly su er from over 5% le misses from half year to only a quarter. Except for the le miss ratio range within 30%-40% and the one within 60%-70%, we see 1 to 4 days of reduction on the le miss ratio ranges over 20%. Additionally, ActiveDR successfully reduced the number of days with 50%-60% le misses from 4 to 0. Considering the fact that re-transmission or re-generation of a le upon a le miss can be very expensive, even such a small reduction can be highly bene cial.

File Miss Reduction
We report the le miss reduction of each user group in Figure 7. Overall, the number of le misses for both FLT and ActiveDR shows an uprising trend with major increases in almost every 3 months of the year. This is because of two reasons. First, when initializing the virtual le system, we only load the last weekly metadata snapshot in 2015 which is already a snapshot of a data retention result produced by the OLCF 90-day retention solution. Thus, in our experiment result, the number of le misses remains small for the rst few months of 2016. Second, as the weekly data retention performs with a 90-day le lifetime setting, more and more les are deleted, which leads to an increasing number of le misses. § This is calculated by summing up all the number of days in each miss ratio range that is larger than 5%.  However, while both le miss numbers of FLT and ActiveDR increase with a growing number of data retention operations being performed, we can see that the number of reduced le misses by ActiveDR, i.e., the gap between FLT and ActiveDR, also grows, for all four types of users. In general, the result indicates that ActiveDR helps reduce the le misses in the long run, as compared to FLT solution.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Both Active Operation Active Only Outcome Active Only Figure 8: Statistics on le miss reduction ratio Figure 8 reports the statistics about the le miss reduction ratio, which is the percentage of le miss reduction introduced by replacing FLT with ActiveDR. On average, as indicated by the green triangles in the box plot, ActiveDR is able to reduce 37% of the le misses for both-active users, 7.5% for operation-active-only, 11.2% for outcome-active-only and 27.5% for both-inactive, as compared to the traditional FLT data retention method. While ActiveDR achieves the largest average le miss reduction ratio for both-active users, it surprisingly achieves the second largest one for the group of both-inactive users, with the maximum reduction of 100%, which is twice large as that of both-active category.

File Miss Reduction Ratio
When examining the details, we found that the number of FLT le misses for both-inactive users remains very small during the rst 4-5 months. However, the number of le misses with the Ac-tiveDR solution is even smaller and sometimes zero in several days during these months. This result leads to the 100% le miss reduction ratio and overall higher miss reduction ratio than those of outcome-active-only and operation-active-only users. Such a phenomenon exactly shows the sensitivity of our approach in re ecting the users' activeness, because the le miss was very low during the very rst few months of 2016. In our emulation, we started data retention iterations from the beginning of 2016 and not many recently accessed les were purged at that time.
Additionally, we can observe that ActiveDR acts very similarly to FLT for operation-active-only users, because only operational activities are considered for those users. This observation also shows a positive sign about the relevance between operations and le accesses, since we only use the job scheduler logs as the source of operations and no le access traces were used. In comparison, we can see the bene ts of considering outcome for the outcomeactive-only users, as this user group experiences noticeably less le misses than operation-active-only users, attributing to ActiveDR.

Retention with Various Lifetime Settings
We also investigate how ActiveDR and FLT behave di erently when the le lifetime is set to 7, 30, 60 and 90 days, respectively. We analyze how much data can be retained by each solution. For this purpose, we select the retention result on the last weekly metadata snapshot we have (which was captured on Aug 23rd of 2016) to examine the details.
ActiveDR prioritizes active users and their le availability. Therefore, when a speci c data retention target is given, we expect that ActiveDR should retain more les for active users and less les for inactive users. In our evaluation, we set up the purge target to be 50% of the total capacity. With this purge target, we run both FLT and ActiveDR to observe the retention result for the users of various activeness categories.  As shown in Figure 9 and Table 4, for both-inactive users with any period lengths, ActiveDR retains 13PB -16PB less data than FLT. The saved space is about half of the 32PB total capacity of the Spider le system. In addition, for each category of users, ActiveDR retains up to 213.47% more data for active users as shown in Table 4 (about 10TB to 2PB more les across di erent period lengths as shown in Table 5).  It is worth noting that the metadata snapshot we use is already a result of the 90-day FLT data retention at OLCF, and a signi cant number of les were already purged from the le system when the metadata snapshot was captured. Therefore, ActiveDR was only able to evaluate a limited number of les from the remaining les as what should be retained given larger period length settings such as 60 days and 90 days. This explains the declining trend of le retention di erence shown in Table 4 and Table 5 as the period length increases. Also, given large period length settings (such as 60-day and 90-day), the retention di erence remains relatively insigni cant for both-active users and operation-active-only users as compared to that of smaller period length settings. This is because almost every job submission can result in renewing the access time of les that the job accesses, and the le access time is exactly what the 90-day FLT retention solution at OLCF monitors. Therefore, the impact of the job activeness and the le access recency remains similar. Moreover, the retention di erence for the outcome-activeonly users given larger period length settings is still remarkably larger than that of both-active and operation-active-only users. This is because the number of outcome-active-only users is either larger than that of both-active users or close to the number of operation-active-only users (as shown in Figure 5). Therefore, it is normal that the total size of les retained for the outcome-activeonly users is larger than that of the other two types of users. In fact, this result exactly shows the bene ts of considering the "outcome" perspective.
We also compare the total size of les purged by ActiveDR and FLT in Figure 10 and Table 6. We can see that, as compared to FLT, ActiveDR purges fewer les for all active users, and purges more les for both-inactive users for 7-day and 30-day period lengths. The le purge e ect of ActiveDR remains about the same as FLT for 60-day and 90-day period lengths. Still, as the period length grows, we observe a declining trend in the le purge di erence and we attribute this to the fact that the metadata snapshot we   use is already a result of the 90-day FLT data retention at OLCF. Also, we can see that the le purge di erences for all activeness types of users are exactly the same as the le retaining di erences shown in Table 5. However, in terms of the purge di erences for both-inactive users, the numbers are much smaller. We can see that ActiveDR does not lose the ability to purge les for inactive users and actually performs better than FLT.

Number of Users
Operation Active Only  Figure 11: Number of users a ected by le purge As shown in Figure 11, by adopting ActiveDR, the number of users a ected by data purge actions in all three active user groups is much smaller than that of the FLT approach. Speci cally, the number of both-active users a ected by le purge actions is less than 60, while such number for the FLT approach is over 700 when the period length is 7 days. This result shows that ActiveDR can protect active users from data loss caused by le purge operations.

Performance Evaluation
As a data retention solution aiming to be used in real systems, we expect ActiveDR to be e cient in terms of both time and space complexity. We report the performance evaluation result in Figure 12. From Figure 12a, we can see that ActiveDR only consumed 48.85MB memory for the user list, 3.5 MB for the publication list and 419.77MB for the job traces. The total time for loading these traces is only 1 minute and 35 seconds. Figure 12b shows that, when executing in parallel mode, the main process takes 700 ms for activeness evaluation while other processes only take a few microseconds to perform the activeness evaluation. All processes accumulatively take 1 to 5 seconds for making purge decision for all 1,040,886 les recorded in the application log. Since le access pattern is no longer a necessary consideration in ActiveDR when evaluating user activeness, we can avoid loading gigabytes of metadata snapshots in real practice. Instead, we load the job activity trace, which only accounts for hundreds of megabytes. This further ensures the rapid process for user activeness evaluation and for making purge decision. When testing the purge e ect with di erent period length settings on a single metadata snapshot, it took about 1 hour to scan the entire metadata snapshot with multiple parallel processes, as shown in Figure 12c. As the metadata snapshot is stored as a series of gzipped text les, each process took about 50 to 400 seconds to scan each le, as shown in Figure 12d.

DISCUSSION
This research study aims to provide a novel data retention strategy that values the data accessibility of active users and promotes fruitful use of HPC system. In our evaluation, we selected job submissions as operation activities and selected the publications as the outcome activities. We made such a choice for two reasons. First, we hope to select a type of operation activity that users perform in the HPC system but it does not have to be directly relevant to any le properties. Also, we hope to select a type of outcome activity unlike job completion or data generation that can be easily captured inside the HPC system. Rather, we would like to select a type of outcome activity that user perform outside the purview of HPC system. In other words, we would like to show how diverse the user activity types can be, with the objective of being practical still. Second, the dataset we have allows us to explore such an interesting combination of operation activities and outcome activities, but also limits us to explore other types of operation activities or outcome activities such as data transfer or data generation.
However, it is noteworthy that the system administrator can choose any type of operation activity and any type of outcome activity which are appropriate for their own system settings. There is no limitation on the type of activities as long as the activities are trackable with occurrence timestamp and quanti able impact factor.  For operation activities, we suggest choosing the ones that can be easily tracked through various logs and traces. When choosing the types of operations, we suggest considering whether the actual data retention strategy is more relevant to le properties. Likewise, when choosing the types of outcomes, we suggest considering whether the resulting data retention strategy is more sensitive to the completion of activities performed on an HPC system or it is more sensitive to other accomplishments that users get outside the purview of the HPC system. Once the operation activities and the outcome activities are selected, the system administrator can utilize various techniques to collect the traces about the selected activities. The system administrator can either utilize logs or traces that are readily available in the HPC system or develop scripts or tools to facilitate tracing activities automatically. Also, if the system can tolerate inaccurate user activeness evaluation to some extent, there is no limitation on the application of manually collected activity traces as well. For example, for the sample operation activities and outcome activities listed in Table 2, most of them can be tracked via readily available logs and traces in the HPC system, such as job submission and job completion (via job scheduler logs), le access and dataset generation (via PFS logs and job scheduler logs). Some of them may need e orts in con guring or developing monitoring tools, such as shell login, data transfer, task completion in a work ow. Others, such as publications resulted from job output, may be captured with a combination of automated solutions (e.g., job-related user ID extractor and publication database crawler) along with additional manual e orts (e.g., manual auditing). Please note that the major focus of our study is to propose an activeness-based data retention solution rather than proposing any activity tracing mechanism. We provide the above discussion as a suggestion or a starting point for any system administrator who might be interested in applying our method in practice. The system administrators eventually have the due right to choose the most appropriate activity types and the corresponding activity tracing methods that meet the need of their system accordingly.
ActiveDR promotes fruitful use of HPC storage space. We currently consider it is a good practice if the users just access their les according to their inherent needs. For most cases, we suggest that the users should actively perform operations and/or generate outcomes and naturally bene t from the convenience ActiveDR provides. However, if the users need to be aware of the le lifetime settings and need to plan for backing up important data les, we suggest that the system administrators can provide the purge trigger interval to the users as a reference.
In our evaluation, we ran our prototype implementation as a regular job. But our prototype implementation proved that our method can be implemented as a parallel program working on HPC systems. In actual practice, the system administrator can implement their own version, which adapts to their data retention work ow and handles the fault-tolerance issue according to their system speci cs.
ActiveDR is not only unique as compared to state-of-the-art data retention strategies. The superiority of ActiveDR lies in its consideration of user activities, its low cost of implementation, and its practicality.

CONCLUSION
Existing data retention methodologies on HPC systems either are limited by compromised e cacy or ignore the dynamics of users' activities and hence undermine the le availability to users. In this study, we rethink the data retention problem from the activenessbased perspective which holistically captures users' activeness. We have introduced ActiveDR, an activeness-based data retention solution which is unique, e ective, and reproducible with the following characteristics: 1) user-friendly: in ActiveDR, we consider users and their activities at the core of its design. The activeness-based perspective holistically captures both operations that users perform on the system and the outcomes that users yield by using the system. We value the user experience of the scratch space and we aim to reduce the le misses for active users; 2) administrator-friendly: the de nition of operations and outcomes in the activeness-based perspective covers a wide range of user activities. Therefore, the administrators can simply utilize traces available or activities captured by monitoring tools they have been using to serve the user activeness evaluation. They can customize the ActiveDR as needed too; 3) resource-friendly: ActiveDR provides an e cient activeness evaluation algorithm that only requires some important properties of user activities. As such, the activeness evaluation process of ActiveDR runs very fast and the memory footprint is negligible; 4) HPC-ecosystem-friendly: ActiveDR is the rst data retention solution that promotes the active and fruitful use of the HPC system, which helps promote productive use of HPC facilities.
Although our evaluation was performed based on user job submission and publication traces, system administrators can select other appropriate activities for user activeness evaluation. ActiveDR is designed for HPC storage system, but its reproducibility and resource-e ciency make it a valuable reference to meet the data management need of other shared storage systems as well. More importantly, our study o ers new insights about HPC storage management problem and can have an impact on new practices in the HPC community.