Privacy-Preserving Algorithms for Machine Learning
- Author(s): Song, Shuang
- Advisor(s): Chaudhuri, Kamalika
- et al.
Modern machine learning increasingly involves personal data, such as healthcare, financial and user behavioral information. However, models trained on such data can reveal detailed information about the data and cause a serious privacy breach. Consequently, it is important to design algorithms that can analyze the sensitive data while still preserving privacy.
This thesis advances the state-of-the-art of privacy-preserving machine learning in the following two major aspects.
First, this thesis addresses the challenges in differentially private large-scale machine learning. On the one hand, with a large amount of sensitive user data, privacy-preserving learning algorithms are expected to achieve improved utility. On the other hand, big data imposes additional challenges, including performance (a.k.a Volume), data noise (a.k.a Veracity), a large number of classes and distributed sources (a.k.a Variety). This thesis presents (1) private versions of the widely used stochastic gradient descent (SGD) algorithm with generalization to data from multiple sources of different privacy requirements, and (2) an improved version of the Private Aggregation of Teacher Ensembles (PATE) framework which can scale to learning tasks with a large number of output classes and uncurated, imbalanced training data.
Second, this thesis considers privacy-preserving data analysis beyond tabular data. Differential privacy is best suited for tabular data, where each record corresponds to all the information about an individual and records are independent of each other. However, many real-world applications involve non-tabular sensitive data, such as epidemic transmission graphs and measurement of the physical activity of a single subject across time. To analyze disease transmission graphs, or more general, graphs with sensitive information stored in each node, this thesis considers privacy-preserving continual release of graph statistics, such as the percentages of highly-active patients over time. The proposed algorithm outperforms the baselines in utility over a range of parameters. For physical activity measurement, or more generally, data with correlation, this thesis looks at a recent generalization of differential privacy, called Pufferfish privacy, that addresses privacy concerns in correlated data. Two mechanisms that work under different scenarios are proposed, and one of them is evaluated on real and synthetic time-series data.