- Main

## Use and analysis of new optimization techniques for decision theory and data mining

- Author(s): Moreno Centeno, Erick
- Advisor(s): Hochbaum, Dorit S
- et al.

## Abstract

This dissertation addresses important problems in decision theory and

data mining. In particular, we focus on problems of the form: Each of

several information sources provides evaluations or measurements of

the objects in a universal set, and the objective is to aggregate

these, possibly conflicting, evaluations into a *consensus*

*evaluation* of each object in the universal set. In addition, we

concentrate on the scenario where each source provides evaluations of

only a strict subset of the objects; that is, each source provides an

*incomplete evaluation*.

In order to define the consensus evaluation from a given set of

incomplete evaluations, two distances are developed: the first is a

distance between incomplete rankings (ordinal evaluations) and the

second is a distance between incomplete ratings (cardinal

evaluations). These two distances generalize Kemeny and Snell’s distance between complete rankings and Cook and Kress’ distance between complete ratings, respectively. Specifically, we introduce a set of natural axioms that must be satisfied by a distance between two incomplete rankings (ratings) and prove the uniqueness and existence of a distance satisfying such axioms. Given a set of incomplete rankings (ratings), the consensus ranking (rating) is

defined as the complete ranking (rating) that minimizes the sum of

distances to each of the given rankings (ratings). We provide several

examples that show that the consensus ranking (rating) obtained by

this approach is more intuitive than that obtained by other approaches.

Finding the consensus ranking is NP-hard, thus we develop two

optimization methodologies to find the consensus ranking: one

efficient approximation algorithm based on the separation-deviation

model and one exact algorithm based on the implicit hitting set

approach. In addition, we show that the optimization problem that

needs to be solved in order to find the consensus rating is a special

case of the separation-deviation model (hereafter SD model), which is

solvable in polynomial time. In this sense, the herein developed

theory (described in the previous paragraph) can be thought of an

axiomatization of the SD model.

Three applications of the SD model are presented: rating the

credit-risk of countries; customer segmentation; and ranking the

participants in a student paper competition. In the credit-risk

rating study, it is shown that the SD model leads to an improved

aggregate rating with respect to several criteria. We compare the SD

model with other aggregation methods and show the following: Although

the SD model is a method to aggregate cardinal evaluations, the

aggregate credit-risk ratings obtained by the SD model are also good

with respect to “ordinal criteria”. Several properties of the SD model are proven, including the property that the aggregate rating

obtained by the SD model agrees with the majority of agencies or

reviewers, regardless of the scale used.

The customer segmentation study shows how to use the SD model to

process data on customer purchasing timing. The outcome of the SD

model provides insights on the rate of new product adoption by the

company’s consumers. In particular, the SD model is used as follows: given the purchase dates for each customer of several products, this information is aggregated in order to rate the customers with regard

to their promptness to adopt new technology. We show that this

approach outperforms unidimensional scaling—a widely used data mining methodology. We analyze the results with respect to various

dimensions of the customer base and report on the generated insights.

The last presented application illustrates our aggregation methods in

the context of the 2007 MSOM’s student paper competition. The

aggregation problem in this competition poses two challenges. First,

each paper was reviewed only by a very small fraction of the judges;

thus the aggregate evaluation is highly sensitive to the subjective

scales chosen by the judges. Second, the judges provided both

cardinal and ordinal evaluations (ratings and rankings) of the papers

they reviewed. This chapter develops the first known methodology to

simultaneously aggregate ordinal and cardinal evaluations into a

consensus evaluation.

Although the content of this dissertation is framed in terms of

decision theory, Hochbaum showed that data mining problems can be

viewed as special cases of decision theory problems. In particular,

the customer segmentation study is a classic data mining problem.