Simple, exact placement of data in containers

Storage and other systems frequently need to distribute objects equally over several sites or devices. While this is simple for a static system organizing the distribution when additional containers (for example hard drives or web content delivery sites) become available is difficult. We present here a very simple scheme based on the factorial number system that allows equal dynamic distribution of mirrored or replicated objects.


I. INTRODUCTION
Placing resources on servers is a perennial problem in Computer Science.It is often modeled as the placing of balls into bins.One example that arises is the allocation of replicated resources to web servers.Here, the problem is not only the efficient use of server storage space, but also quality of service guarantees and a fit to the geographical distribution of demand.Another important example is storage virtualization, where we want to give the impression of an on-line, practically unlimited storage space.A storage center will have many different disks, often of different capacity and possibly with rather different access times.
We propose a general solution to the specific problem of how to evenly distribute k replicas of N objects among n bins such that (a) each bin receives the same number of objects, (b) two replicas never share the same bin, and (c) equal distribution of reconstruction load, by which we mean that if a bin fails and is replaced by a replacement bin, we can regenerate the replicas of the objects on the failed bins from the other bins in such a way that each bin receives the same amount of load.Since we do not know a priori the objects we will store and we allow insertion and deletion of objects, these properties refer to expected values.
Our solution is based on the basic properties of the factorial number system.Moving from an ordinary (binary, decimal, or hexadecimal) number system to the factorial number system is as simple and as fast as changing among ordinary number systems.As we will see, the factorial number systems presents a very simple way to randomly distribute objects among a given number of bins.We later expand our solution -at the cost of additional calculations -to containers of arbitrary sizes.
A first application of our technique is the distribution of resources to cloud data centers.We rent the same amount of storage space in different data centers and may want to be able to acquire more storage space not only by extending the space we rent in the centers that we already use, but also by renting space on new centers.A second application is the storage of metadata in the distributed RAM of a multicomputer.Here we can assume that all machines have the amount of memory.A third example is extensible hashing, where we want the buckets to have the same size.In the first example, replication is critical for data durability; in all three, it is essential for preventing temporal bottlenecks due to the popularity of specific items.
The remainder is organized as follows: Section II reviews previous work on resource placement; Section III introduces the factorial number; Section IV introduces our technique and Section V expands it to the case of containers with arbitrary sizes; finally Section VI has our conclusions.

II. RELATED WORK
Litwin et al. extended linear hashing to resource placement in a variable number of servers [10].The goal of LH* and its variants is constant time access to objects in a distributed system, with little and possibly inaccurate knowledge of clients over the state of the dynamic system.LH* schemes place objects into buckets but are not concerned with achieving equitable distribution nor do they take the storage capacity of the system nodes into account.LH* splits a predetermined bucket whenever it detects a bucket overflow, but not necessarily the overflowing bucket.If the total number of nodes used is not a power of two, then there are two classes of buckets, with the nodes in one class having twice the expected number of buckets as the other.
Fagin [5] proposed extensible hashing with similar goals.The difference is that in extensible hashing, any overflowing hash bucket can split, whereas the distributed version of linear hashing splits buckets in a fixed order.Both can be helpful in devising placement schemes.Both implement Scalable Distributed Data Structures (SDDS) [9], which allow files to expand to new servers gracefully without central information that limit access primitives such as search and insertion to only update a single client.
Plaxton et al. [12] addressed the problem of dynamic placement of replicated objects in the context of the web.They propose a hashed-suffix routing structure.Karger et al. [7] propose consistent hashing to improve on Plaxton's result.While their scheme yields a faithful distribution, it does not allow for replica placement.Chen et al. [4] built on Plaxton's work to choose the number and placement of replicas while satisfying QoS requirements and server capacity constraints.
In the context of storage virtualization, Brinkmann et al. [2], [3] presented several schemes that present an excellent compromise among various goals (see below), but that are not absolutely faithful, i.e. are not expected to distribute objects equally among containers.Schindelhauer similarly improves on consistent hashing allowing storage containers of different capacities [13].Based on their target application, Brinkmann and colleagues propose this list for an ideal placement scheme: 1) Faithfulness: The expected number of resources placed in a bin is between (1− )d i •m and (1+ )d i •m for all containers i, where can be made arbitrarily small.2) Time Efficiency: The scheme can calculate the position of a resource efficiently.3) Compactness: The amount of information needed to calculate is small.In particular, it should only depend on the number of containers and on the number of resources in a logarithmic way.4) Adaptivity: After a change in the number of containers, the number of resources, or the storage capacities, the distribution of the resources over the containers can quickly adapt to recover faithfulness.A measure of success in achieving this goal is competitiveness.A placement strategy is called c-competitive if at most c times the number of resources are moved than in an optimal adaptive and perfectly faithful strategy.5) Obliviousness: The placement of resources into containers only depends on the resource identifiers and the number and sizes of containers, not on the history of the system.Honicky and Miller [6] proposed RUSH to place objects in an object-based storage system.Their main insight is the nature of updates in a storage system, since new storage containers are added in clusters.In practice, even a highly dynamic system will not undergo a great number of additions.An improved version, CRUSH, removes issues that make RUSH an insufficient scheme in practice [14].
In comparison to previous work, we propose a simpler (and mathematically more elegant scheme) that is absolutely faithful, time efficient, very compact, and oblivious, but not always optimally adaptive unless we add or remove one container from the ensemble.

III. FACTORIAL REPRESENTATION
The factorial number system [8] is a system with the mixed radices 1!, 2!, 3!, 4!, . ... We denote the expression of a number x in this system with x f .If a number x is given in this system as According to the inequality, the second addend μ−1 ν=1 (b ν − a ν )ν! has an absolute value strictly smaller than μ!.However, b μ −a μ is not zero, and as an integer at least one.Therefore, the first addend (b μ − a μ )μ! has absolute value at least μ!.If both representations represent the same number, then of course the sum should be zero, which is impossible.This contradiction proves the uniqueness of representation.
The calculation of the factorial representation can be done most easily by successive divisions with remainder by 2, 3, . ... As an example, we take 1000.The least significant digit of the factorial representation results from dividing 1000 by 2 with remainder, yielding 1000 = 2 • 500 + 0. Proceeding, we divide by 3, giving us 500 = 3 • 166 + 2. The next step is division by 4, giving us 166 = 4 • 41 + 2, then division by 5, giving us 40 = 5 • 8 + 1, by 6, giving us 8 = 6 • 1 + 2 and finally by 7, giving us 1 = 7 • 0 + 1.We assemble the digits in ascending order in an array [0, 2, 2, 1, 2, 1].We verify that As Figure 1 shows, computing the factorial representation of any positive integer uses the same iterative approaches the algorithms for computing decimal and hexadecimal representations of numbers.There are two differences: First, our divisor changes at each iteration step; second, our result starts with the least significant digit contrary to the custom for most other representations of numbers.

IV. ALGORITHM
We have to distribute k replicas among N containers.We treat first the case of a single replica (k = 1), since it is easier to understand.We then generalize to an arbitrary number of replicas.

A. Single Replica
We describe the placement using a system that starts with one container, bin 0, and adds successively other containers of equal size, bin 1, bin 2, . . .The number of containers is the state of the system.The location of a resource is a function that only depends on the state and the Resource Identifier (RID)  of the resource to be stored.We assume that RIDs behave like random numbers.For example, they could be SHA-1 or even MD5 hashes of the unique name of the resources.
In State 1, we have only one container and all resources are located in bin 0. If we add a second container, we enter State 2 and will have to move (about) half of the resources to bin 1 in order to rebalance the load.To decide which resources should be moved, we consult the first digit x 1 of the factorial representation of all RIDs.As we obtained this digit by computing the RID modulo 2, its sole possible values are zero and one.If the value is 0, then we move the resource to bin 1, otherwise we leave it in bin 0.
Adding a third container bring us to State 3. We need to move (about) one third each of the objects in bin 0 and bin 1 to bin 2. We select them based on the second digit x 2 of the factorial representation of their RID, which can only be equal to 0, 1, or 2. As long as RIDs behave as a random number (which would be the case if they were hashes of the resource names), each of these digits will be selected with equal probability and independently of the previous digit.If x 2 equals 0, then we move the resource to bin 2.
Should a fourth container become available, we would enter State 4. Since the third digit x 3 of the factorial representation of each RID can only be equal to 0, 1, 2, or 3, moving to new bin all objects whose RID is equal to 0, would move about one fourth of the contents of bins 1 to 3 into the bin 4.
In general, we assign an object with RID x to bin i, if the index of the last occurrence of 0 in the list representation x 0 , x 1 , x 2 , x 3 , . . . is equal to i.
By induction, we conclude that the expected values of the number of resources in each container are equal.We also note that each addition of a container leads to the minimum number of movements necessary in order to balance the load, as we only move to the new container, but not among the old containers.When we lower the number of containers, we undo the previous extensions, as the assignment of resources to containers depends only on state and RID.We give the algorithm as pseudo-code in Figure 2.

B. Several Replicas
We now treat the case of k replicas.Each replica has a replica number r (rep in Figure 3) from 0 to k − 1.We never want to store replicas of the same object in the same container  and need therefore at least k containers to start.If we are in this situation, we assign the replica with replica number r to bin r.Clearly, in this initial state, all containers have the same number of things to store, namely one replica per object.
If an additional container becomes available, we need to either let the replicas of a certain object stay in their current location or move one of the replicas into the new bin.In order to balance the load of the containers, we need to move 1/(k + 1) of the contents of each old container to the new container.Thus, for a given object, we should not move any of the replicas with probability 1/(k +1) and we should move exactly one replica, but not more, with probability k/(k + 1).In the later case, we need to pick the replica to be moved with equal probability.We use the digit x k of the factorial representation of the RID of the object to make the decision.This digit corresponds to the radix (k)! and has a value in {0, . . ., k}.If this digit has a value x k = k, then we do not move any replica.Otherwise, we move the replica with replica number x k .Now assume an additional container, bin k + 1 becomes available.Again, we need to move an equal proportion, in this case 1/(k +2), of the contents of the old containers to the new container.We use the digit x k+1 of the factorial representation of the RID of a resource.If x k+1 ≥ k, then we leave the replicas of the resource where they currently are.Otherwise, we move the replica with replica number x k+1 to the new container, bin k + 1.
In general, when we introduce bin l, we consult the digit x l in the factorial representation of the RID of the resource.If the digit is smaller than k, we move the replica with replica number x l to the new bin, otherwise, we do not move any replicas.As at most one replica is moved to a certain container when that container becomes available, and as initially, replicas of the same object are in different containers, our algorithm never places two replicas of the same object into the same bin.
As an example, consider an object with RID 12345678910.Its factorial representation is [0, 2, 3, 2, 1, 3, 3, 3, 1, 3, 9, 12, 1].(These digits form the upper row of numbers in Figure 4).We assume that we place three replicas, starting out with 3 containers.We place the replicas 0, 1, and 2 in bins 0, 1 and 2, respectively (Figure 4, upper row).When we add one container, bin 3, we use the fourth-least significant digit of the factorial representation, namely 2. Since this is a replica number, we place replica 2 in the new bin.(Figure 4, second row).The next digit, 1 corresponds to introducing the fifth container.It leads us to place replica 1 into bin 4. The three digits 3 that follow mean that for this object, no replica changes container for the next three extensions.However, when we introduce bin 8, we have 1 as the corresponding digit, and we switch that replica to bin 8.The last row of Figure 4 shows the distribution of replicas with eleven containers.
The average number of objects in each container is equal, as we now argue.If we have k replicas and k containers, then each container contains the same number of objects.For an inductive step, we assume that our algorithm distributes the objects evenly over N containers and that we add a new container to the ensemble.If the new container is bin N + 1, we see that it contains the first, second, third, etc. replica in equal proportions.In the general case, the new container will contain objects with RID identifier whose N th factorial digit is in the set {0, 1, . . .k − 1}.This places a replica of k/(N + 1) of all objects into the new container.As there are k replica per container, 1/(N + 1) of the contents of the system is placed in the new container.Furthermore, the proportion of first, second, third, etc. replica of objects is still the same.One of the first k containers (bins 0, 1, . .., k − 1) looses an object to the new container with probability 1/(N + 1), moving 1/N • 1/(N + 1) of all contents, and reducing its share of total contents from 1/N to 1/(N + 1).Another previous container (bins k, . .., N − 1) looses a first replica (with replica number 0) to the new container with probability 1/(N + 1), and with the same probability a second, third, etc. replica.Consequentially, it looses 1/(N + 1) of its contents, amounting to 1/N •1/(N +1) of total contents.We have shown by induction that each container contains (in expectation) the same number of objects.Additionally, we have shown that the contents of each added container (bins k, k + 1, . ..) contains replicas in the same proportion.For a specific replica with replica number s, it will be located with probability s/N in bin s and with probability 1/N in one of the bins k, k + 1, . .., N − 1 (Figure 5).
The amount of data movement during an expansion from N − 1 to N containers is expected to be 1/N of the contents of each old container to the new container.This is optimal.
Assume that the system requests contents by randomly selecting a replica number and then ask for the replica at the container storing it and assume now that a container is inaccessible.If our strategy selects replica r for a certain object and that copy is located on the failed bin, and if we then select another replica number s, then the request will go with probability k/N to bin s and with probability 1/N to one of bin k, bin k + 1, . .., bin N − 1.Thus, if we select the second replica number at random, the request will be served by a certain bin with equal probability 1/(N − 1).With other words, our system divides load equally in the presence of an inaccessible server.
If we want to lower the total number of containers, we remove the last one, undoing the last expansion.This gives optimal reallocation traffic, where each remaining bin receives traffic worth 1/(N − 1) of the total size of the system.
The solution is more involved if we are forced to remove an arbitrary container.Assume that container bin i has to be removed.We rename the last added container, bin N − 1 to bin i.A portion of 1/(N −1) of the contents of the former bin N − 1 came from bin i and stay there.The former bin N − 1 sends (N − 2)/N − 1) of its contents to the other containers in equal shares.As the new bin i, it needs to recoup the other (N − 2)/(N − 1) of the contents of the old bin i, namely those that were not sent to the then new bin N − 1 in the last extension.Of the total contents, we need to ship from the old bin N − 1 to other containers and we need to ship 1 N from the other containers to the old bin N − 1, which has become the new bin i.This means we move 2N −3 N (N −1) of the contents.While not optimal by a factor of 2 − 3 N , at least the work is equally distributed over the remaining containers due to our previous observation.
In order to use our scheme, our resource identifiers need to yield all possible values for the digits in the factorial representation we use.Since the digits correspond to the containers, this gives a relationship between the maximum number of bins and the number of binary digits in the RID (Table I).If RIDs are hash values of unique identifiers, then the maximum number of containers is relatively small, but sufficient for the purposes where we propose our scheme instead of the more involved versions in the literature.One simple possibility to overcome the size limitation is to use the RID in order to seed a good, but cheap random number generator, such as a Mersenne twister or a simple congruential random number generator with a Bays-Durham shuffle [1], [11] in order to generate more digits pseudo-randomly.

V. EXTENSION TO ARBITRARY SIZES
We use the factorial representation of RIDs as a way to encode the decision when to move a recourse to a new container.We can extend this method (at the cost of additional calculations) to containers of arbitrary sizes.Assume that the sizes are s 0 , s 1 , s 2 , . ... We mimic the calculation of the factorial digit by using hash functions.We observe that we need to move objects from the existing containers to a new container, bin N , with probability If this is the case, then our scheme and its properties hold.We assume that we have k replica of each object and that the first k containers (bins 0 to k − 1) have equal size.
To organize the moves, we associate the RID of an object to a decision list L as follows.We use the RID as the seed of a random number generator and assume that the generator yields the random number sequence (r k , r k+1 , . ..).We partition the unit interval [0, 1] into two parts [0, p k ) and (p k , 1] and the left part into k equally sized subparts, giving us a partition (decomposition into mutually non-intersecting subsets) If a k falls into the first interval, we put 0 into L, if it falls into the second interval, we put 1, and if it falls into the last interval, we put k.We call this the quasi-digit x k .We similarly proceed with k + 1, generating a list of quasi-digits L = [x k , x k+1 , . ..].
We then apply our scheme using the quasi-digits in L instead of the digits of the factorial representation.If we have N containers, we place the replica r of an object with RID R into the container b if the maximum index of the digit r in the list L generated by R up to and including index N − k is b, or, if the quasi-digit r does not appear, the replica is stored in bin r.The same analysis as before applies.

VI. CONCLUSIONS
We have presented a general solution to the problem of evenly distributing k replicas of N objects among n bins of equal size.Our solution assumes that all objects are identified by a random resource ID and makes all its allocation decisions based on the factorial representation of that resource ID.Its outcome is a replica-to-bin mapping such that (a) each bin receives the same number of objects, (b) two replicas never share the same bin, and (c) the task of reconstituting the contents of a lost bin is equally distributed among the surviving bins.In addition, our technique handles gracefully any change in the number of bins.Since computing the factorial representation of an integer is a fairly simple operation, there is no need to store these representations anywhere.
In addition, we have presented two extensions of our methods.A first extension eliminates any restriction on the maximum number of bins N .The second applies to bins of arbitrary size.
Supported in part byGrant CCF-1219163, by the Department of Energy under Award Number DE-FC02-10ER26017/DE-SC0005417 and by the industrial members of the Storage Systems Research Center.

Fig. 1 .
Fig. 1.Algorithm to return the factorial representation of a non-negative number as a list in little-endian (least significant digit first) order.

Fig. 2 .
Fig. 2. Algorithm to calculate placement of a resource with RID and nrBin containers.

Fig. 3 .
Fig.3.Algorithm to calculate placement of replicas.The parameters are RID, the identity of the resource, rep, the identity of the replica, nrBin, the number of containers, and nrRep, the total number of replica.

Fig. 4 .
Fig. 4.Replica distribution for RID 12345678910 for three to eleven containers.

Fig. 5 .
Fig. 5. Distribution of the s-th replica among N bins.

TABLE I RELATIONSHIP
BETWEEN MAXIMUM NUMBER OF BINS AND LENGTH (NUMBER OF BINARY DIGITS) OF THE RECORD IDENTIFIER (RID).