As organizations and companies are increasingly offloading data and computation to the cloud to reduce infrastructure administration,
data volume keeps growing and new services and algorithms are needed to meet increasing demands for both storage capacity and privacy.
The first part of my thesis will address cloud data backup.
Organizations and companies often backup and archive high volumes of binary and text datasets for fault tolerance,
internal investigation, and electronic discovery. Source-side deduplication has an advantage to avoid or minimize duplicated data transmitted over the network,
however it demands more computing resource to perform extensive fingerprint comparison which would otherwise be available for primary services at the source.
For data stored in the cloud, users need efficient, scalable services for searching these files.
In the first part of this thesis, I will cover the key components of existing solutions for large-scale backup storage in the cloud.
I will go into detail on how deduplication is important to large scale backup systems, and review some ongoing work.
I will also detail my contributions in this area towards low-profile source-side deduplication.
The second part of my thesis addresses an open problem for efficient private document search on data hosted on the cloud.
As sensitive information is increasingly centralized into the cloud, for the protection of data privacy, such data is often encrypted,
which makes effective data indexing and search a very challenging task. To overcome the challenges of querying encrypted datasets,
searchable encryption schemes allow users to securely search over encrypted data through keywords.
No existing solutions for efficient ranking which involves complex arithmetic computation in feature composition and scoring currently exist,
and without relevant ranking of search results queries over very large datasets which may return many results can be impractical.
In the second part of my thesis I will review existing work on private search and introduce our ongoing and published work for this open problem,
focusing on how to make private search practical and scalable for large datasets.