Fast and efficient failure recovery is a new challenge for cloud storage
systems with a large number of storage nodes. A pivotal recovery metric upon
the failure of a storage node is repair bandwidth cost which refers to the
amount of data that must be downloaded for regenerating the lost data. Since
all the surviving nodes are not always accessible, we intend to introduce a
class of maximum distance separable (MDS) codes that can be re-used when the
number of selected nodes varies yet yields close to optimal repair bandwidth.
Such codes provide flexibility in engaging more surviving nodes in favor of
reducing the repair bandwidth without redesigning the code structure and
changing the content of the existing nodes. We call this property of MDS codes
progressive engagement. This name comes from the fact that if a failure occurs,
it is shown that the best strategy is to incrementally engage the surviving
nodes according to their accessing cost (delay, number of hops, traffic load or
availability in general) until the repair-bandwidth or accessing cost
constraints are met. We argue that the existing MDS codes fail to satisfy the
progressive engagement property. We subsequently present a search algorithm to
find a new set of codes named rotation codes that has both progressive
engagement and MDS properties. Furthermore, we illustrate how the existing
permutation codes can provide progressive engagement by modifying the original
recovery scheme. Simulation results are presented to compare the repair
bandwidth performance of such codes when the number of participating nodes
varies as well as their speed of single failure recovery.