In this paper, we examine an important recent rule-based information
extraction (IE) technique named Boosted Wrapper Induction (BWI), by conducting
experiments on a wider variety of tasks than previously studied, including
tasks using several collections of natural text documents. We provide a
systematic analysis of how each algorithmic component of BWI, in particular
boosting, contributes to its success. We show that the benefit of boosting
arises from the ability to reweight examples to learn specific rules (resulting
in high precision) combined with the ability to continue learning rules after
all positive examples have been covered (resulting in high recall). As a
quantitative indicator of the regularity of an extraction task, we propose a
new measure that we call SWI ratio. We show that this measure is a good
predictor of IE success. Based on these results, we analyze the strengths and
limitations of current rule-based IE methods in general. Specifically, we
explain limitations in the information made available to these methods, and in
the representations they use. We also discuss how confidence values returned
during extraction are not true probabilities. In this analysis, we investigate
the benefits of including grammatical and semantic information for natural text
documents, as well as parse tree and attribute-value information for XML and
HTML documents. We show experimentally that incorporating even limited
grammatical information can improve the regularity of and hence performance on
natural text extraction tasks. We conclude with proposals for enriching the
representational power of rule-based IE methods to exploit these and other
types of regularities.
Pre-2018 CSE ID: CS2002-0696