UC San Diego
Learning to detect malicious URLs
- Author(s): Ma, Justin Tung
- et al.
Malicious Web sites are a cornerstone of Internet criminal activities. They host a variety of unwanted content ranging from spam-advertised products, to phishing sites, to dangerous "drive-by'' exploits that infect a visitor's machine with malware. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. The most prominent existing approaches to the malicious URL problem are manually- constructed blacklists, as well as client-side systems that analyze the content or behavior of a Web site as it is visited. The premise of this dissertation is that we should be able to construct a lightweight URL classification system that simultaneously overcomes the challenges that face blacklists (which have manual updates that can quickly become obsolete) and client-side systems (which are difficult to deploy on a large scale because of their high overhead). To this end, our contribution is that we develop a highly effective system for malicious URL detection that (in its final form) leverages large numbers of features and online learning to scalably and adaptively construct an accurate classifier. Because our system exploits large amounts of training data and adapts to day-by-day variations, we are able to classify URLs with up to 99% accuracy. As part of pursuing malicious URL detection, this dissertation addresses issues that arise from the use of online learning for this application. Thus, our further contributions include advances in understanding the role of uncertainty in online learning, as well as the benefits of exploiting feature correlations in high-dimensional applications such as URL classification. Overall, the contributions of this dissertation make significant advances in improving malicious URL detection and understanding the role of online learning in this application