Wu, Yue

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization

2024

Wu, Yue
Advisor(s): Gu, Quanquan

Abstract

This dissertation investigates several challenges in artificial intelligence (AI) alignment and reinforcement learning (RL), particularly focusing on applications when only preference feedback is available. Learning from preference feedback has been one central problem across different fields such as ranking, recommendation systems, and social choice theory. Recently, reinforcement learning from human feedback (RLHF) has also shown its strong potential in utilizing weakly supervised human data (preference feedback) and its ability to encode human values into machine learning models accurately. This dissertation aims to comprehensively characterize preference-based statistical learning, focusing on the sample complexity of ranking and preference model estimation and fine-tuning large language models.

The first part of the dissertation explores novel methods for the learning-to-rank problem. I studied learning to rank under the strong stochastic transitivity (SST) condition, a prevalent model without assuming a score for each option. SST assumes that the accuracy of the comparison between two items increases as the disparity in their quality widens. I proposed one of the first adaptive approaches that can effectively aggregate the feedback from different human labelers and illustrated how the relationship between the number of human queries and resulting performance depends on the properties of the human labelers. I further follow up in this direction with my collaborators and provide active ranking algorithms that can work without strong stochastic transitivity. We developed novel algorithms under this practical yet harder setting. Our efficient algorithm requires fewer human queries compared with algorithms designed with stronger assumptions. The algorithm is provably optimal.

The second part focuses on preference learning without transitivity assumption. In reality, humans rarely make consistent comparisons and often demonstrate contradicting preferences such as a loop within the preference relations. I considered the most general setting where there is no transitivity at all. I proposed algorithms that identify the Borda winner, an optimal choice even when a true underlying rank does not exist. I showed the algorithm enjoys minimum regret, a notion that trades between exploration and exploitation. This result sheds light on the fundamental difficulty and cost of recovering human preferences under the fewest assumptions.

The third part also focuses on preference learning without transitivity assumption, but instead considers an alternative definition of learning objective, the von Neumann winner. I first formulate the general preference as a game environment where two players aim to win over each other and then present an algorithmic framework that can solve this game in a self-play manner asymptotically. The algorithm is then extended to the task of fine-tuning large language models and shows remarkable empirical performance.

The methods and techniques discussed in this dissertation cover a full spectrum of different assumptions and settings of preference learning. In each setting, the new algorithms are presented along with theoretical analysis ensuring a tight performance guarantee. Additionally, during the exploration of different settings, new research directions and open questions are identified, which could help promote the research of preference learning in terms of sample efficiency in the future.

Main Content

For improved accessibility of PDF content, download the file to your device.

UCLA

Learning from Human Feedback: Ranking, Bandit, and Preference Optimization