Domingos, Pedro

A unified approach to concept learning

1997

Domingos, Pedro

Abstract

This dissertation proposes a unification of two leading approaches to concept learning: rule induction and instance-based learning.

Current rule induction algorithms based on the "separate and conquer" paradigm suffer from the fragmentation of the training set produced as induction progresses, and from high error rates in rules covering few examples (the "small disjuncts problem"). Current instance-based learners are unable to select different attributes in different regions of the instance space. The limitations of either approach can be addressed by bringing in elements of the other.

In this dissertation, the two paradigms are unified by noting the relationship between the representations they use, and introducing a new algorithm to learn concept descriptions in the unified representation. Instances and rules are unified syntactically by viewing instances as maximally specific rules, and semantically by allowing rules to match examples approximately. The RISE algorithm learns rules by gradually generalizing instances until no improvement in accuracy is obtained. Theoretical analysis shows this approach to be efficient and asymptotically optimal. An extensive empirical study using benchmark datasets shows that RISE consistently succeeds in improving on the predictive accuracy of its parent paradigms, and also on the accuracy of state-of-the-art decision tree learners. Lesion studies verify that each of RISE's components is essential to its performance. Studies in carefully controlled artificial domains show that RISE's advantage relative to other rule induction algorithms is at least partly due to its ability to reduce the fragmentation and small disjuncts problems, and that its advantage relative to other instance-based learners is at least partly due to its ability to select different attributes in different regions of the instance space.

The application of RISE to large databases is made possible through the use of sampling techniques, most notably partitioning, which can sometimes simultaneously reduce running time and improve accuracy. Finally, a data mining algorithm based on RISE's "conquering without separating" strategy is introduced, and shown to have linear worst-case running time in all the relevant parameters, while achieving accuracies at the level ofmore expensive state-of-the art systems, producing much simpler output, and being highly robust with respect to noise.

Main Content

For improved accessibility of PDF content, download the file to your device.

Donald Bren School of Information and Computer Sciences

A unified approach to concept learning