Modern Models for Learning Large-Scale Highly Skewed Online Advertising Data
Click through rate (CTR) and conversation rate estimation are two core prediction tasks in online advertising. However, four major challenges emerged as data scientists trying to analyze the advertising data - sheer volume, the amount of data available for mining is massive; complex structure, there is no easy way to tell what factors drive a user to click an ad or make a conversion and how the factors interacted with one another; high cardinality for categorical variables, features like device id usually have tons of possible values which will lead to very sparse data; severe skewness in response variable with the majority of the users not clicking the ad. In this paper, I will make a comprehensive summary of the state-of-art machine learning models (decision tree based, regularized logistic regression, online learning, and factorization machine) that are often used in the industry to solve the problem. Insights and practical tricks are then provided based on a wide range of experiments conducted on multiple data sets with different characteristics.