This thesis focuses on developing novel techniques to improve the performance and computational efficiency of Extreme Multi-label Text Classification (XMC). XMC involves learning a classifier to assign the most relevant subset of labels to an input from a vast label space, often in the order of millions. Over the past decade, XMC has found relevance in numerous large-scale applications, such as query-to-ad-phrase matching in search ads, title-based product recommendation, and prediction of related searches.
There are two major concerns in XMC, catering to the two sides of the problem: one that addresses the short-text query format of the input and the other that addresses the long-text document format of the input. Since its inception, the field has witnessed consistent gains in performance with the scaling of compute in the document format space, while the query format space, due to its online nature, has been limited to computationally efficient smaller models. This thesis addresses both the short-text and long-text nature of XMC. More specifically, the thesis not only provides computationally efficient solutions to the symmetric problem setting, where both input instances and label features are short-text in nature, but also challenges the state-of-the-art in the asymmetric problem, where input instances are long-text, using only a single GPU!
The first chapter proposes Gandalf, a novel approach that leverages a label co-occurrence graph to use label features as additional data points to supplement the training distribution. By exploiting the characteristics of short-text XMC, Gandalf constructs valid training instances from the label features and uses the label graph to generate corresponding soft-label targets, effectively capturing label-label correlations. Models trained on these new instances, despite being less than half of the original dataset, can outperform models trained on the original dataset, particularly on the PSP@k metric for tail labels. Gandalf can be applied in a plug-and-play manner to various state-of-the-art algorithms, leading to an average 5\% relative improvement across 4 benchmark datasets with up to 1.3M labels.
The second chapter addresses the computational cost of conventional XMC models, which often require up to 640GB VRAM to train on the largest public dataset. This high cost is a consequence of calculating the loss over the entire label space. The chapter proposes UniDEC, a loss-independent, end-to-end trainable framework that trains the dual encoder (DE) and classifier together in a unified manner with a multi-class loss, reducing the computational cost by 4-16$\times$. UniDEC employs a pick-some-label (PSL) reduction to compute the loss on only a subset of carefully chosen positive and negative labels in-batch, maximizing their supervisory signals. The proposed framework achieves state-of-the-art results on datasets with millions of labels while being computationally efficient, requiring only a single GPU. UniDEC's state-of-the-art performance and computational efficiency have led to its successful deployment in Microsoft Bing's query-to-ads business, serving over 500 million advertisments in over 100 countries across the EMEA, APAC, and Latin America regions from September 2023 to March 2024.
In summary, this thesis introduces novel techniques to improve the performance and computational efficiency of XMC. The proposed methods, Gandalf and UniDEC, leverage label features, capture label-label correlations, and significantly reduce computational costs while maintaining or improving performance, thus advancing the state-of-the-art in compute efficient extreme multi-label classification.