On Distributed Learning Techniques for Machine Learning
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

On Distributed Learning Techniques for Machine Learning

Abstract

Modern machine and deep learning algorithms require a lot of data and computation and benefit from large and representative datasets (e.g., ImageNet and COCO), as much data and computational resources as possible should be gathered. However, in some of the applications, collecting abundant examples for certain classes in domains such as biology and medicine may be impossible in practice. For instance, in dermatology, there are some rare diseases with a small number of patients. It is, therefore, natural to ask can we train a model using data that is naturally dispersed among different parties in practice (e.g., edge devices and hospitals, etc.) without explicitly sharing their data? Federated Learning (FL) is a recently proposed distributed training framework in edge computing environment that enables distributed edge devices to collaboratively train a global model under the orchestration of a central server without compromising the privacy of their data. While FL has great potential, it faces challenges in practical settings, including statistical data heterogeneity (Non-IID data), personalization, fairness, computation overhead, and communication cost. I designed techniques that could alleviate the mentioned challenges in FL. FL is explicitly designed for Non-IID edge devices. A global model can perform well on personalized predictions if the edge device’s context and personal data are nicely featured and embodied in the dataset, which is not the case for most edge devices. Most techniques for personalization either fail to build a model with low generalization error or are not very effective, especially when local distributions are far from the average distribution. In addition, when the edge devices are the edge devices, computational efficiency, and communication cost will be a crucial bottleneck, as edge devices are typically constrained by computational limitations and upload bandwidth of 1 MB/s or less.

On the other side of the spectrum, due to redundancy, datasets' information content is much smaller than their actual volume, despite their steady growth. Existing techniques are not effective in identifying and extracting the non-redundant information content, considering the intrinsic structure of the data. Hence, machine learning models are trained on massive data volumes, which necessitates exceptionally large and expensive computational resources. A crucial challenge of machine learning today is to develop methods that can extract representative information volumes and accurately and robustly learn from the extracted representatives. My methods have immediate application to high-impact problems where massive data prohibits efficient learning and inference, such as GAN, recommender systems, graphs, video, and other large data streams. My research approach to uniquely address this challenge consists of (1) extracting the information volume by summarizing the most representative subsets. 2) At the core of my research lie rigorous and practical techniques that provide strong guarantees for the quality of the extracted representatives and the learned models' accuracy. My proposed methods open up new avenues for learning from representative data extracted from massive datasets.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View