Peer loss [1] is a new family of loss functions proposed to deal with the problem of learning with noisy labels. It claims to handle a wide range of label noise in binary classification tasks without explicitly estimating the noise rates. Numerical experiments demonstrate the effectiveness of peer loss. However, its extension to the multi-class classification remains unclear, and its working mechanism is not fully understood.
In this thesis, we study the theory of peer loss from three distinct perspectives. Follow the original method in [1], we first consider the multi-class extension of peer loss and investigate its noise tolerance properties. From this perspective, we see peer loss as a class of loss functions inspired by the truthful and proper scoring rules in the peer prediction literature. It turns out that this perspective is a static one and cannot provide a satisfactory explanation of how peer loss works in practical training. To gain an intuitive picture of the working mechanism, we further develop a divergence perspective towards peer loss, expressing it as the difference between two KL divergences. Thus, we recognize that peer loss has a built-in regularization effect, encouraging the model to make confident predictions. This regularization effect partially explains why peer loss works well under the label noise, as the existence of noise often blurs the data distribution and makes the resulting model prediction uncertain. Finally, we show that peer loss potentially suggests a new type of risk in decision theory, i.e., the correlation risk. This new perspective helps us to understand better what the model learns when trained with peer loss. To complete the discussion of the correlation risk perspective, we develop a novel method to investigate the training dynamics of peer loss. This dynamical analysis justifies that with peer loss, the resulting model tends to grasp the positive correlations in the training datasets.
In addition to the theoretical analysis, we also carry out extensive numerical experiments. Those experiments on benchmark image datasets demonstrate the effectiveness of peer loss on the multi-class classification tasks under a wide range of label noise. Our experiments on the 2-dimension synthetic dataset reveal that the models trained with peer loss tend to produce hard decision boundaries. This phenomenon accords with our theoretical analysis that peer loss encourages the model to make confident predictions.