Recommender system data presents unique challenges to the data mining,
machine learning, and algorithms communities. The high missing data rate, in
combination with the large scale and high dimensionality that is typical of
recommender systems data, requires new tools and methods for efficient data
analysis. Here, we address the challenge of evaluating similarity between two
users in a recommender system, where for each user only a small set of ratings
is available. We present a new similarity score, that we call LiRa, based on a
statistical model of user similarity, for large-scale, discrete valued data
with many missing values. We show that this score, based on a ratio of
likelihoods, is more effective at identifying similar users than traditional
similarity scores in user-based collaborative filtering, such as the Pearson
correlation coefficient. We argue that our approach has significant potential
to improve both accuracy and scalability in collaborative filtering.