Search

Scholarly Works (4 results)

Sort By:

Article

Why Phishing Works

Recent Work (2006)

To build systems shielding users from fraudulent (or phishing) websites, designers need to know which attack strategies work and why. This paper provides the first empirical evidence about which malicious strategies are successful at deceiving general users. We first analyzed a large set of captured phishing attacks and developed a set of hypotheses about why these strategies might work. We then assessed these hypotheses with a usability study in which 22 participants were shown 20 web sites and asked to determine which ones were fraudulent. We found that 23% of the participants did not look at browser-based cues such as the address bar, status bar and the security indicators, leading to incorrect choices 40% of the time. We also found that some visual deception attacks can fool even the most sophisticated users. These results illustrate that standard security indicators are not effective for a substantial fraction of users, and suggest that alternative approaches are needed.

Thesis
Peer Reviewed

Taming Evasions in Machine Learning Based Detection Pipelines

UC Berkeley Electronic Theses and Dissertations (2016)

This thesis presents and evaluates three mitigation techniques for evasion attacks against machine learning based detection pipelines. Machine learning based detection pipelines provide much of the security in modern computerized system. For instance, these pipelines are responsible for the detection of undesirable content on computing platforms and Internet-based services, such as malicious software and email spam. By its adversarial nature, the security application domain exhibits a permanent arms race between attackers who aim to avoid, or evade, detection and the pipeline's maintainers whose aim is to catch all undesirable content.

The first part of this thesis examines a defense technique for the concrete application domain of comment spam on social media. We propose content complexity, a compression-based normalized measure of textual redundancy that is mostly insensitive to the underlying language used and adversarial word spelling variations. We demonstrate on a real dataset of tens of millions of comments that content complexity alone achieves 15 percentage points higher precision than a state-of-the-art detection system.

The second part of this thesis takes a quantitative approach to evasion and introduces one machine learning algorithm and one

learning framework for building hardened detection pipelines. Both techniques are generic and suitable for a large class of application domains. We propose the convex polytope machine, a non-linear large-scale learning algorithm which aims at finding a large-margin polytope separator and thereby decrease the effectiveness of evasion attacks. We show that as a general purpose machine learning algorithm, the convex polytope machine displays an outstanding trade-off between classification accuracy and computational efficiency. We also demonstrate on a benchmark handwritten digit recognition task that the convex polytope machine is quantitatively as evasion-resistant as a

classic neural network.

We finally introduce adversarial boosting, a boosting-inspired framework for iteratively building ensemble classifiers that are hardened against evasion attacks. Adversarial boosting operates by repeatedly constructing evasion attacks and adding the corresponding corrective sub-classifiers to the ensemble. We implement this technique for decision tree sub-classifiers by constructing the first exact and approximate automatic evasion algorithms for tree ensembles. For our benchmark task, the adversarially boosted tree ensemble is respectively five times and two times less evasion-susceptible than regular tree ensembles and the convex polytope machine.

Cover page: Taming Evasions in Machine Learning Based Detection Pipelines

Thesis
Peer Reviewed

Scalable Platform for Malicious Content Detection Integrating Machine Learning and Manual Review

UC Berkeley Electronic Theses and Dissertations (2015)

This thesis examines the design, implementation and performance of a scalable analysis platform for the detection of malicious content. To reflect the deployment of actual production systems, we design our platform to explicitly model the passage of time and the involvement of human supervisors in the analysis process. This thesis shows how our platform can operate efficiently at a large scale. The thesis presents and evaluates our platform in the context of a case study focused on malware detection.

To model the passage of time while still allowing for batch training methods our platform discretizes time into a series of retraining periods, allowing updated samples and labels to emerge during each period. During each retraining period, our platform combines the presently deployed model with externally available information about newly emerged samples to select samples for submission to a human labeling oracle. To support a large volume of data over successive timeframes, our platform uses advanced techniques to manage the size of data including compression and selective data retention. These operations support efficient feature extraction.

Our platform is implemented in Python, allowing use of both the Python scientific stack (Numpy, Scipy, Scikit-Learn) and IPython for interactive, distributed computation. In the interest of scalability our system uses HDFS and Apache Spark to manage distributed data and computation. This thesis discusses our implementation as well as the hardware and software configuration supporting our system.

This thesis presents an evaluation of our work using a malware dataset containing over 1 million samples collected over a period of 2.5 years. It begins by characterizing our dataset, including an examination of label shift over time motivating our work. It presents evidence demonstrating that by submitting a small fraction of samples for human review we are able to appreciably increase detection outcomes.

We have released our code along with 3% of our case study data, allowing replication of our results on a single node. Note that detection performance will vary due to the decrease in available training data.

Cover page: Scalable Platform for Malicious Content Detection Integrating Machine Learning and Manual Review

Article
Peer Reviewed

The security of machine learning

UC Berkeley Previously Published Works (2010)

Machine learning’s ability to rapidly evolve to changing and complex situations has helped it become a fundamental tool for computer security. That adaptability is also a vulnerability: attackers can exploit machine learning systems. We present a taxonomy identifying and analyzing attacks against machine learning systems. We show how these classes influence the costs for the attacker and defender, and we give a formal structure defining their interaction. We use our framework to survey and analyze the literature of attacks against machine learning systems. We also illustrate our taxonomy by showing how it can guide attacks against SpamBayes, a popular statistical spam filter. Finally, we discuss how our taxonomy suggests new lines of defenses.

Cover page: The security of machine learning