The Internet is rife with abuse: examples include spam, phishing, malicious advertising, DNS abuse, search poisoning, click fraud, and so on. To detect, investigate, and defend against such abuse, security efforts frequently crawl large sets of Web sites that need to be classified into categories, e.g., the attacker behind the abuse or the type of abuse.
Domain expertise is often required at first, but classifying thousands to even millions of Web pages manually is infeasible. In this dissertation, I develop machine learning tools to help security practitioners classify Web pages at scale. These automated, data-driven methods are made possible by the efforts of miscreants to operate at scale. Crafting every scam from scratch is too expensive, so miscreants use some degree of automation and replication to recreate their attacks. As a result, underlying similarities in both Web site content and structure can link related pages together. In the end, this automated classification of ``big data'' collected from the Web has significant impact, as it enables large-scale measurement and informs potential defensive interventions.
This dissertation focuses on three applications. First, I present a system for monitoring Web sites that serve as online storefronts for spam-advertised goods. The system is highly accurate, even when training data is very limited. Second, I describe a system for identifying the black hat SEO campaigns that promote online stores selling counterfeit luxury goods. This system was used to nearly double the number of known campaigns to track, and increase the number of associated stores by 69%. Third, I discuss a system for categorizing the Web content hosted in new top-level domains. In total, this system was used to classify 4.1 million domains in 480 new TLDs.
Overall, today's scale of well-organized cybercrime demands the use of scalable defensive analysis. This setting is where the data-driven techniques of machine learning prove especially useful. Furthermore, large-scale classification has become a frequent need in security, and our methods are more generally applicable to problems beyond just the ones documented in this dissertation.