New Outliers Removal and Machine Learning Augmented Algorithms
This dissertation largely studies problems of two types. In the first part, we study ranking and clustering in the presence of outliers. The goal is to remove up to a specific number of elements and then find a more meaningful clustering/ranking on the remaining ones.
- Rank Aggregation with Outliers. In the rank aggregation from pairwise comparisons, the objective is to order n given elements by aggregating pairwise comparison information which could be inconsistent. Inconsistencies caused by outliers can highly affect the final ranking. We initiate the study of rank aggregation with outliers in which we are allowed to discard a small fraction of the elements as outliers to get a better ranking.
- k-means Clustering with Outliers. In the standard k-means clustering we are given a set of n points in Euclidean space (or any metric space) and a target number of clusters k and the goal is to choose a set of k points from the space as centers, so as to minimize the sum of the squared distances of every point to its closest center. Since k-means clustering is highly sensitive to noise, we study it in the presence of outliers by allowing to discard a small fraction of points. The previous algorithms for k-means clustering with outliers are either complicated, do not have provable guarantees, suffer from high running times or are not applicable in the distributed settings. The algorithms we develop address these challenges simultaneously.
In the second part of this dissertation, we bring our work on augmenting machine learned prediction to traditional algorithm design for online problems, specifically for extensions of two classical online problems, non-clairvoyant job scheduling and knapsack. Algorithms developed in this model benefit from the prediction given by the estimator. The goal is to get better results if the prediction is accurate and get results not much worse than traditional algorithms guarantee even with error-prone predictions.
- Non-Clairvoyant Scheduling with Predictions. In the most basic version of non-clairvoyant scheduling, we have a set of jobs that need to be scheduled on a single machine with the goal of minimizing the total completion time of all jobs. Equivalently, it measures the total waiting time of all jobs. The job sizes are unknown to the algorithm and only become known after the jobs have completed. It is of high importance to have an error measure of high quality as it guides the development of algorithms. We propose a new error measure that leads to better algorithms. - Online Knapsack with Frequency Predictions. In the classic online knapsack problem, we have a knapsack of certain capacity and n items that arrive online. Each item i has a profit and a size. When an item arrives, an online algorithm must make an irrevocable choice of whether to accept or reject the item. The goal is to select a subset of items of the highest total profit whose total size is at most the knapsack capacity. We consider a weak prediction model where we are given as prediction a range in which the total size of items of each value per unit lies. We show how even such a weak prediction can be utilized effectively to provably improve the performance of online algorithms. We also extend the results to more general settings such as generalized one-way trading and two-stage online knapsack.
None of the above problems can be solved optimally either due to not having the complete input beforehand or the problem being computationally hard. Nevertheless, we seek to design and analyze algorithms with rigorous guarantees close to the optimum solution.