The goal of this thesis is to systematically extract information from security forums,
whose information would be in general described as unstructured: the text of a post
is not necessarily following any writing rules. By contrast, many security initiatives and
commercial entities are harnessing the readily public information, but they seem to focus
on structured sources of information. Here, we focus on analyzing text content in security
forums to extract actionable information. Specifically, we search and nd: IP addresses
reported in the text, study keyword-based queries, and identify and classify threads that
are of interest to the security analysts.
The power of our study lies in the following key novelties. First, we use a matrix
decomposition method to extract latent features of the user behavioral information,
which we combine with textual information from related posts. Second, we address the
labeling difficulties by utilizing a cross-forum learning method that helps to transfer knowledge
between models. Third, we develop a multi-step weighted embedding approach, more
specifically, we project words, threads, and classes in appropriate embedding spaces and establish relevance and similarity there. These novel approaches enable us to extract and
refine information which could not be obtained from security forums if only trivial analyses
were used.
We collected a wealth of data from six different security forums. The contribution
of our work is threefold: (a) we develop a method to automatically identify malicious IP
addresses observed in the forums; (b) we propose a systematic method to identify and
classify user-specified threads of interest into four different categories, and (c) we present
an iterative approach to expand the initial keywords of interest which are essential feeds in
searching and retrieving information.
We see our approaches as essential building blocks in developing useful methods
for harnessing the wealth of information available in online forums.