Collocated Data Deduplication for Virtual Machine Backup in the Cloud
Cloud platforms that host a large number of virtual machines (VMs) have high storage demand for frequent backups of VM snapshots. Content signature based deduplication is necessary to eliminate excessive redundant blocks. While dedicated backup storage systems can be used to reduce data redundancy, such an architecture is expensive and introduces huge network traffic in a large cluster. This thesis research is focused on a low-cost backup and deduplication service collocated with other cloud services to reduce infrastructure and network cost.
The previous research for cluster-based data deduplication has concentrated on various inline solutions. The first part of the thesis work is a highly parallel batched solution with synchronized backup scalable for a large number of virtual machines. The key idea is to separate duplicate detection from the actual storage backup, and to partition global index and detection requests among machines using fingerprint values. Then each machine conducts duplicate detection partition by partition independently with minimal memory consumption. Another optimization is to allocate and control buffer space for exchanging detection requests and duplicate summaries among machines. The resource requirement in terms of memory and disk usage for the proposed solution is very small while the backup efficiency in terms of overall throughput and time is not compromised. Our evaluation validates this and shows a satisfactory backup throughput in a large cloud setting.
The second part of the thesis work is a VM-centric collocated backup service with inline deduplication. The key difference compared to the previous work is its novelty in fault resilience and low resource usage. We propose a multi-level selective deduplication scheme which integrates similarity-guided and popularity-guided duplicate elimination under a stringent resource requirement. This scheme uses popular common data to facilitate fingerprint comparison, localizes deduplication as much as possible within each VM, and associates underlying file blocks with one VM for most of cases. The main advantage of this scheme is that it strikes a balance between inner and inter VM deduplication, increasing parallelism and improving reliability. Our analysis shows that this VM-centric scheme can provide better fault tolerance while using a small amount of computing and storage resource. We have conducted a comparative evaluation of this scheme on its competitiveness in terms of deduplication efficiency and backup throughput.