# Your search: "author:Hochbaum, Dorit S"

## filters applied

## Type of Work

Article (8) Book (0) Theses (6) Multimedia (0)

## Peer Review

Peer-reviewed only (14)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (14) UC Davis (0) UC Irvine (0) UCLA (0) UC Merced (0) UC Riverside (0) UC San Diego (0) UCSF (0) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (0) Lawrence Berkeley National Laboratory (0) UC Agriculture & Natural Resources (0)

## Department

UC Berkeley Library (1)

## Journal

## Discipline

Medicine and Health Sciences (1)

## Reuse License

## Scholarly Works (14 results)

Integer programming formulations play a key role in the design of efficient algorithms and approximation algorithms for many discrete linear optimization problems as found in previous research. A specific integer programming formulation of a discrete linear optimization problem may lead to algorithms with better running time or approximation algorithms with better approximation factors than the other algorithms for the same problem. Inspired by this, we introduce here new integer programming formulations to devise efficient algorithms and approximation algorithms for two classes of problems, the covariate balancing problems and the Replenishment Storage problems.The existence of certain special integer programming formulations is an evidence of the existence of polynomial time algorithms for some discrete optimization problems. For a variant of the covariate balancing problems, our new integer programming formulation has a network structure that was not previously known. We use this new formulation to devise network flow algorithms, which have better running time than an existing polynomial time algorithm. Other integer programming formulations we introduce show that several variants of the class of covariate balancing problems are fixed-parameter tractable. Integer programming formulations are often important in designing approximation algorithms for intractable problems. The Replenishment Storage problems were known to be NP-hard and one approximation algorithm was known for a special variant of the class. We derive for this variant a polynomial time approximation scheme using a new integer programming formulation. Our new formulation also leads to the first known approximation scheme for another variant of this class. Moreover, this resolves the complexity status of some variants of the Replenishment Storage problems as weakly NP-hard.

Markov random field (MRF) is a multi-label clustering model with applications in image segmentation, image deblurring, isotonic regression, graph fused lasso, and semi-supervised learning. It is a convex optimization problem on an arbitrary graph of objective function consisting of functions defined on the nodes (called deviation functions) and edges (called separation functions) of the graph. There exists a provably fastest MRF-algorithm for arbitrary graphs and a broad class of objective functions. This MRF-algorithm uses the technique of graph minimum cut. Although MRF of this class of objective functions generalizes isotonic regression and graph fused lasso, this MRF-algorithm has lower time complexity than those specialized algorithms for isotonic regression and graph fused lasso.

Some problems in time series and gene sequence signal processing are special cases of MRF on a path graph. We present three efficient algorithms to solve MRF on a path graph for different classes of objective functions. The algorithms are the fastest algorithms for the respective classes of objective functions. The first algorithm uses graph minimum cut technique inspired by the provably fastest MRF-algorithm. The second algorithm is based on a relationship with a lot-sizing problem in production planning. The third algorithm is based on the technique of Karush-Kuhn-Tucker (KKT) optimality conditions.

MRF is used in image segmentation to identify multiple salient features in an image. The Hochbaum Normalized Cut (HNC) model is a binary clustering model that is also applicable to image segmentation, with the goal to separate the foreground from the background. We compare the empirical performance between the HNC model and the spectral method in image segmentation. Both HNC and the spectral method can be viewed as heuristics for the NP-hard normalized cut criterion, which is another binary clustering criterion often used for image segmentation. We present experimental evidence that the HNC model provides solutions which not only better approximate the optimal normalized cut solution, but also have better visual segmentation quality over those provided by the spectral method. The experimental evidence further suggests that HNC should be the preferred image segmentation criterion over normalized cut.

The main focus of this thesis is using machine learning and data mining techniques to solve challenging problems. Three problems from different subject areas are discussed: nuclear material detection, drug ranking and target tracking in video sequences. The techniques of the three problems described are all based on an efficiently solvable variant of normalized cut, Normalized Cut Prime (NC').

The first problem concerns detecting concealed illicit nuclear material, an important part of strategies preventing and deterring nuclear terrorism. What makes this an extremely difficult task are physical limitations of nuclear radiation detectors (arising from energy resolutions and efficiency) and shielding materials terrorists would presumably use to surround the radioactive nuclear material and absorb some of the radiation, thereby reducing the strength of the detected signal. This means the central data analysis problem is identifying a potentially very weak signal, and distinguishing it from both background noise arising from the detector characteristics and naturally occurring environmental radiation. We aim at enhancing the capabilities of detection with algorithmic methods specifically tailored to nuclear data. A novel graph-theory-based methodology based on NC' is used, called Supervised Normalized Cut (SNC). This data mining method classifies measurements obtained from very low resolution plastic scintillation detectors. The accompanying computational study, comparing SNC method with several alternative classification methods shows that in terms of accuracy, the SNC method is on par with alternative approaches, yet SNC is computationally more efficient.

The second subject area is in the field of drug ranking. This problem refers to placing in rank order, according to their effectiveness, several drugs treating the same disease, using data derived from cell images. Current technologies use the recently developed high-throughput drug profiling (high content screening or HCS). Despite the potential of HCS for accurate descriptions of drug profiles, it produces a deluge of data of quantitative and multidimensional nature, posing analytical challenges in the data mining process. Our new framework is designed to alleviate these difficulties, in the way of producing graph theoretic descriptors and automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS), a way of converting classification to scores. We experimented with the FABS framework by implementing different algorithms and assessing the accuracy of results by a comparative study, which includes other four baseline methods. The conclusion is encouraging: FABS implemented with NC' consistently outperforms other implementations of FABS and alternative methods currently used for ranking that are unrelated to FABS.

The third problem is target tracking in video sequences -- it can be framed as an unsupervised learning problem: the goal is to delineate a target of interest in a video from background. The tracking task is cast as a graph-cut, incorporating intensity and motion data into the formulation. Tests on real-life benchmark videos show that the developed technique, NC-track, based on NC', is more efficient than many existing techniques, and that it delivers good quality results.

Similarity-based machine learning methods differ from traditional machine learning methods in that they also use pairwise similarity relations between objects to infer the labels of unlabeled objects. A recent comparative study for classification problems by Baumann et al. [2019] demonstrated that similarity-based techniques have superior performance and robustness when compared to well-established machine learning techniques. Similarity-based machine learning methods benefit from two advantages that could explain superior their performance: They can make use of the pairwise relations between unlabeled objects, and they are robust due to the transitive property of pairwise similarities.

A challenge for similarity-based machine learning methods on large datasets is that the number of pairwise similarity grows quadratically in the size of the dataset. For large datasets, it thus becomes practically impossible to compute all possible pairwise similarities. In 2016, Hochbaum and Baumann proposed the technique of sparse computation to address this growth by computing only those pairwise similarities that are relevant. Their proposed implementation of sparse computation is still difficult to scale to millions objects.

This dissertation focuses on advancing the practical implementations of sparse computation to larger datasets and on two applications for which similarity-based machine learning was particularly effective. The applications that are studied here are cell identification in calcium-imaging movies and detecting aberrant linking behavior in directed networks.

For sparse computation we present faster, geometric algorithms and a technique, named sparse-reduced computation, that combines sparse computation with compression. The geometric algorithms compute the exact same output as the original implementation of sparse computation, but identify the relevant pairwise similarities faster by using the concept of data shifting for identifying objects in the same or neighboring blocks. Empirical results on datasets with up to 10 million objects show a significant reduction in running time. Sparse-reduced computation combines sparse computation with a technique for compressing highly-similar or identical objects, enabling the use of similarity-based machine learning on massively-large datasets. The computational results demonstrate that sparse-reduced computation provides a significant reduction in running time with a minute loss in accuracy.

A major problem facing neuroscientists today is cell identification in calcium-imaging movies. These movies are in-vivo recordings of thousands of neurons at cellular resolution. There is a great need for automated approaches to extract the activity of single neurons from these movies since manual post-processing takes tens of hours per dataset. We present the HNCcorr algorithm for cell identification in calcium-imaging movies. The name HNCcorr is derived from its use of the similarity-based Hochbaum's Normalized Cut (HNC) model with pairwise similarities derived from correlation. In HNCcorr, the task of cell detection is approached as a clustering problem. HNCcorr utilizes HNC to detect cells in these movies as coherent clusters of pixels that are highly distinct from the remaining pixels. HNCcorr guarantees, unlike existing methodologies for cell identification, a globally optimal solution to the underlying optimization problem. Of independent interest is a novel method, named similarity-squared, that we devised for measuring similarity between pixels. We provide an experimental study and demonstrate that HNCcorr is a top performer on the Neurofinder cell identification benchmark and that it improves over algorithms based on matrix factorization.

The second application is detecting aberrant agents, such as fake news sources or spam websites, based on their link behavior in networks. Across contexts, a distinguishing characteristic between normal and aberrant agents is that normal agents rarely link to aberrant ones. We refer to this phenomenon as aberrant linking behavior. We present an Markov Random Fields (MRF) formulation, with links as the pairwise similarities, that detects aberrant agents based on aberrant linking behavior and any prior information (if given). This MRF formulation is solved optimally and in polynomial time. We compare the optimal solution for the MRF formulation to well-known algorithms based on random walks. In our empirical experiment with twenty-three different datasets, the MRF method outperforms the other detection algorithms. This work represents the first use of optimization methods for detecting aberrant agents as well as the first time that MRF is applied to directed graphs.

This dissertation addresses important problems in decision theory and

data mining. In particular, we focus on problems of the form: Each of

several information sources provides evaluations or measurements of

the objects in a universal set, and the objective is to aggregate

these, possibly conflicting, evaluations into a *consensus*

*evaluation* of each object in the universal set. In addition, we

concentrate on the scenario where each source provides evaluations of

only a strict subset of the objects; that is, each source provides an

*incomplete evaluation*.

In order to define the consensus evaluation from a given set of

incomplete evaluations, two distances are developed: the first is a

distance between incomplete rankings (ordinal evaluations) and the

second is a distance between incomplete ratings (cardinal

evaluations). These two distances generalize Kemeny and Snell’s distance between complete rankings and Cook and Kress’ distance between complete ratings, respectively. Specifically, we introduce a set of natural axioms that must be satisfied by a distance between two incomplete rankings (ratings) and prove the uniqueness and existence of a distance satisfying such axioms. Given a set of incomplete rankings (ratings), the consensus ranking (rating) is

defined as the complete ranking (rating) that minimizes the sum of

distances to each of the given rankings (ratings). We provide several

examples that show that the consensus ranking (rating) obtained by

this approach is more intuitive than that obtained by other approaches.

Finding the consensus ranking is NP-hard, thus we develop two

optimization methodologies to find the consensus ranking: one

efficient approximation algorithm based on the separation-deviation

model and one exact algorithm based on the implicit hitting set

approach. In addition, we show that the optimization problem that

needs to be solved in order to find the consensus rating is a special

case of the separation-deviation model (hereafter SD model), which is

solvable in polynomial time. In this sense, the herein developed

theory (described in the previous paragraph) can be thought of an

axiomatization of the SD model.

Three applications of the SD model are presented: rating the

credit-risk of countries; customer segmentation; and ranking the

participants in a student paper competition. In the credit-risk

rating study, it is shown that the SD model leads to an improved

aggregate rating with respect to several criteria. We compare the SD

model with other aggregation methods and show the following: Although

the SD model is a method to aggregate cardinal evaluations, the

aggregate credit-risk ratings obtained by the SD model are also good

with respect to “ordinal criteria”. Several properties of the SD model are proven, including the property that the aggregate rating

obtained by the SD model agrees with the majority of agencies or

reviewers, regardless of the scale used.

The customer segmentation study shows how to use the SD model to

process data on customer purchasing timing. The outcome of the SD

model provides insights on the rate of new product adoption by the

company’s consumers. In particular, the SD model is used as follows: given the purchase dates for each customer of several products, this information is aggregated in order to rate the customers with regard

to their promptness to adopt new technology. We show that this

approach outperforms unidimensional scaling—a widely used data mining methodology. We analyze the results with respect to various

dimensions of the customer base and report on the generated insights.

The last presented application illustrates our aggregation methods in

the context of the 2007 MSOM’s student paper competition. The

aggregation problem in this competition poses two challenges. First,

each paper was reviewed only by a very small fraction of the judges;

thus the aggregate evaluation is highly sensitive to the subjective

scales chosen by the judges. Second, the judges provided both

cardinal and ordinal evaluations (ratings and rankings) of the papers

they reviewed. This chapter develops the first known methodology to

simultaneously aggregate ordinal and cardinal evaluations into a

consensus evaluation.

Although the content of this dissertation is framed in terms of

decision theory, Hochbaum showed that data mining problems can be

viewed as special cases of decision theory problems. In particular,

the customer segmentation study is a classic data mining problem.

This dissertation explores the use of geometric and graphical models for a variety of information search and filtering applications. These models serve to provide an intuitive understanding of the problem domains and as well as computational efficiencies to our solution approaches.

We begin by considering a search and rescue scenario where both human and automated agents share control over a fleet of unmanned aerial vehicles (UAVs) with the goal of locating a missing subject as quickly as possible. We describe a new interface and search framework, Hydra, which merges the intuition, reasoning, and vision capabilities of humans with the computational power of machines to reduce the expected time to locate the subject. The interface allows participating human agents to collaboratively decide where to send the UAVs via spatial dynamic voting, a geometric method for aggregating regional selections (votes) on a map. Via extensive simulation and theoretical analysis, we show that our method can be an effective component of search and rescue operations.

In the next chapter, we present a new graph-theoretical model for filtering a large set of genes to identify those that exhibit the most significant change in expression values between a series of control and test experiments; this is known as the Gene Selection Problem. Although not a geometric model in the traditional sense, graph theory allows us to organize data in abstract geometric spaces, where similarity metrics are used to define relative distances between nodes of data as opposed to working with an absolute coordinate system. Our algorithm first pre-processes the data using statistical hypothesis testing to filter out statistically irrelevant genes, and then we analyze the expression levels recorded for each gene by modeling them on a graph and evaluating the capacity of the cut between the test and control experiments. The capacity of a cut on a graph is a measure of the separation between two disjoint sets of nodes, and we use this value to rank the genes. We evaluated our model on a rich data set assessing the success of embryo implantation in mice in the presence or absence of uterine dendritic cells. A thorough biological analysis of our results enabled the discovery of significant factors that were not identified by more traditional, statistical methods.

In the remaining chapters of this dissertation, we transition to a series of algorithms and models for filtering information in a collaborative, social context. We begin by presenting a new, constant-time recommender system for jokes that adapts in real-time to changes in user preferences or mood. We also present an extension of this system that makes personalized recommendations on where participants might wish to donate their money.

Chapters 5 and 6 consider the domain of collaborative opinion and idea sharing in an online setting. We present a new tool, Opinion Space, that we are developing for visualizing and crowdsourcing a diversity of insights collected via textual responses to a discussion question. Opinion Space projects participants onto a two-dimensional plane using Principal Component Analysis based on their levels of agreement with a series of statements. The projection is specifically designed so that participants with similar opinions will be near each other in the space; this allows participants to easily navigate the diversity of opinions shared by others.

Over the last two years, we have released multiple versions of Opinion Space and collected several rich data sets for analysis. In Chapter 5 we describe the interface and design decisions made when building the site. We also present results from a controlled user study comparing user engagement with Opinion Space versus more traditional models of online opinion sharing (specifically, linear comment lists). Not only did we find that participants were significantly more engaged with Opinion Space, but they had significantly higher levels of agreement with and respect for the responses that they read.

In Chapter 6 we present several models, both geometric and statistical, for ranking the contributions of our participants based on how insightful they are. Our primary model considers the spatial relationships between users in addition to the ratings they give each other; the intuition behind the model can be described as follows. By giving users the opportunity to rate the responses they read, we allow for the very likely possibility that users will only promote their own interests and rate opposing opinions poorly, even if it is a well-written and pointed response. We claim that this behavior is of little value towards our objective of identifying insightful ideas, because users are simply reinforcing their own opinions. Visually, one can imagine that the space of users is partitioned into subgroups or smaller spheres of agreement, and we are interested in emphasizing the comments where these spheres intersect. In this scenario, we have identified users of different viewpoints that have potentially found a legitimate middle ground.

Chapter 7 provides concluding remarks on our work with Opinion Space from a New Media and social responsibility perspective, and we present preliminary results on future work in the area.