Search

Scholarly Works (18 results)

Sort By:

Show:

Article
Peer Reviewed

Self bounding learning algorithms

Freund, Yoav

UC San Diego Previously Published Works (1998)

Cover page: Self bounding learning algorithms

Article
Peer Reviewed

Active learning for visual object detection

Technical Reports (2006)

One of the most labor intensive aspects of developing ac- curate visual object detectors using machine learning is to gather sufficient amount of labeled examples. We develop a selective sampling method, based on boosting, which dra- matically reduces the amount of human labor required for this task. We apply this method to the problem of detecting pedestrians from a video camera mounted on a moving car. We demonstrate how combining boosting and active learn- ing achieves high levels of detection accuracy in complex and variable backgrounds.

Pre-2018 CSE ID: CS2006-0871

Cover page: Active learning for visual object detection

Thesis
Peer Reviewed

Online MindReader Game Utilizing Weighted Hedging Trees

Elliott, Matthew
Advisor(s): Freund, Yoav

UC San Diego Electronic Theses and Dissertations (2018)

The MindReader web app is an online freely accessible version of the “matching pennies” game. Matching pennies is an old game mentioned in the works of Edgar Allen Poe and analyzed by Claude Shannon, but our implementation is very modern. By visiting, “www.mindreaderpro2.appspot.com”, a user can instantly play on their computer, tablet, or phone and can save their game results by signing into Facebook. The website is part of the larger MindReader project, which collects game data from the app and provides a data analysis platform. We present the inner workings of the project so that academic researchers can easily run and contribute to MindReader.

Cover page: Online MindReader Game Utilizing Weighted Hedging Trees

Article
Peer Reviewed

Using Boosting for Financial Analysis and Performance Prediction: Application to S&P 500 Companies, Latin American ADRs and Banks

UC San Diego Previously Published Works (2010)

This paper demonstrates how the boosting approach can support the financial analysis functions in two ways: (1) As a predictive tool to forecast corporate performance, and rank accounting and corporate variables according to their impact on performance, and (2) As an interpretative tool to generate alternating decision trees that capture the non-linear relationship among accounting and corporate governance variables that determine performance. We compare our results using Adaboost with logistic regression, bagging, and random forests. We conduct 10-fold cross-validation experiments on one sample each of S&P 500 companies, American Depository Receipts (ADRs) of Latin American companies and Latin American banks. Adaboost results indicate that large companies perform better than small companies, especially when these companies have a limited long-term assets to sales ratio. Performance improves for large LAADR companies when the country of residence is characterized by a weak rule of law. In the case of S&P 500 companies, performance increases when the compensation for top officers is mostly variable.

Cover page: Using Boosting for Financial Analysis and Performance Prediction: Application to S&P 500 Companies, Latin American ADRs and Banks

Article
Peer Reviewed

BioSpike: Efficient search for homologous proteins by indexing patterns

Technical Reports (2006)

Since the availability of high throughput sequencing tools, the number of known protein sequences has been growing at an unprecedented rate. On the other hand, information about structure or function of proteins is extremely sparse. Biologists that study proteins make extensive use of protein search engines to find homologous sequences whose structure or function are known. One well known measure for sequence similarity is the Smith-Waterman (SW) alignment score. As calculating the SW score is computationally expensive, various approximations for finding homologous sequences have been suggested, and of these the current de-facto standard for protein searching are the BLAST and PSI-BLAST methods of Altschul et al. While BLAST is an efficient approximation algorithm to the optimal SW alignment, it is still, from a computer science standpoint, a very inefficient method as it compares the query sequence to each and every sequence in the database. We present a method for indexing and searching proteins using amino acid patterns. As a source of patterns, we use the BLOCKS library of Henikoff and Henikoff. Position specific scoring matrices are used to identify pattern occurrences. Each iteration consists of a âscanâ in which we identify all statistically significant pattern occurrences in the sequence set; and a refinement stage, in which we use the identified occurrences to define better PSSMs. The final refined PSSMs are then used to index proteins in the UniProt Knowledgebase (UniProtKB), creating an efficient and accurate tool for searching protein homologues.

Pre-2018 CSE ID: CS2006-0858

Cover page: BioSpike: Efficient search for homologous proteins by indexing patterns

Thesis
Peer Reviewed

Playing Games to Reduce Supervision in Learning

Balsubramani, Akshay
Advisor(s): Freund, Yoav

UC San Diego Electronic Theses and Dissertations (2016)

In this dissertation, we explore two fundamental sets of inference problems arising in machine learning and statistics. We present robust, efficient, and straightforward algorithms for both, adapting sensitively to structure in data by viewing these problems as playing games against an adversary representing our uncertainty.

In the problem of classification, there is typically much more unlabeled data than labeled data, but classification algorithms are largely designed to be supervised, only taking advantage of labeled data. We explore how to aggregate the predictions of an ensemble of such classifiers as accurately as possible in a semi-supervised setting, using both types of data. The insight is to formulate the learning problem as playing a game over the unlabeled data against an adversary, who plays the unknown true labels. This formulation uses unlabeled data to improve performance over labeled data alone in an extremely general and efficient way, without model assumptions or tuning parameters. We demonstrate this by devising and evaluating a number of practical, scalable semi-supervised learning algorithms. The theoretical contributions include a proof that the optimal aggregation rules in this semi-supervised setting are artificial neurons for many natural loss functions, with efficient convex algorithms for learning them.

We also provide fundamental results for a second set of problems relating to sequential learning and testing. Random variation in such situations can typically be described by a martingale, a generalization of a random walk that describes any repeated fair game. We describe the concentration behavior of a martingale's sample path, extending to finite times the law of the iterated logarithm, a classic result of probability theory. With this powerful tool, we are able to show how to design simple sequential tests that use as few samples as possible to detect an effect, provably adapting to the unknown effect size. We also apply our results to optimally correct the p-values of many common statistical hypothesis tests, making them robust to the common practice of 'peeking' at test results and waiting for a significant one to report.

Cover page: Playing Games to Reduce Supervision in Learning

Article
Peer Reviewed

Faster Boosting with Smaller Memory

UC San Diego Previously Published Works (2019)

Article
Peer Reviewed

Random projection trees and low dimension manifolds

Technical Reports (2007)

We present a simple variant of the k-d tree which automatically adapts to intrinsic low dimensional structure in data without having to explicitly learn this structure.

Pre-2018 CSE ID: CS2007-0890

Cover page: Random projection trees and low dimension manifolds

Thesis
Peer Reviewed

Automated Bee Waggle Dance Detection

Bansal, Tushar
Advisor(s): Freund, Yoav

UC San Diego Electronic Theses and Dissertations (2018)

A major limitation on performing detailed behavioral analysis of honey bee colonies is that there is currently no efficient way to carry it out. Due to the time required in manually analyzing the data, the current approach and small sample sizes limit the statistical power of these analyses. An automated system can provide a breakthrough in the way this research is performed. Waggle dances are an important aspect of understanding the behavior of honey bees as it serves as a way to communicate among themselves. In this thesis, we develop an automated system using computer vision and learning techniques to solve two problems i) Single bee tracking and waggle detection and ii) Multiple bee waggle detection. Our approach shows that it is possible to train learning algorithms to detect when and where a waggle happens in the hive.

Cover page: Automated Bee Waggle Dance Detection

Thesis
Peer Reviewed

Parallel Boosting and Learning from Diverse Datasets

Alafate, Julaiti
Advisor(s): Freund, Yoav

UC San Diego Electronic Theses and Dissertations (2020)

This thesis is a study of boosting. It consists of two parts. In the first part, we develop a new way of parallelizing boosting. In the second part, we apply boosting to the problem of bathymetry data editing and study the issues of experimental design for diverse datasets.

The first part of this thesis presents a parallel boosting algorithm that achieves a significant speedup while keeping a small memory footprint. It combines two novel techniques. One is a method for parallelization with weak synchronous requirement which we call "Tell Me Something New" (TMSN). The other is a method we call stratified weighted sampling that significantly reduces the I/O load of boosting.

We implemented our algorithm using the Rust programming language and demonstrated its superior performance when memory size is limited. Our experiments show a 10-100x speedup over two of the popular implementations of boosted trees, XGBoost and LightGBM, when training data is too large to fit in memory.

The second part of this thesis involves a project that uses boosting as an aid in the bathymetry data editing. Bathymetry is a study of the depths and shapes of underwater terrain. The objective of our project is to create a binary classifier that separates the correct depth measures from the incorrect ones. Our experimental results challenge the standard assumption that training and testing samples are both drawn i.i.d. from a fixed distribution.

First, we examine spurious correlation, where some training and testing samples are similar to each other because they are duplicates, near-duplicates, or sequentially collected. A simple memorization-based model could achieve a low in-sample validation error in these cases, but its out-of-sample test error is much worse.

Second, we examine data diversity, in which datasets are not diverse enough to be representative. It happens when the feature dimension is so high that collecting a representative sample is difficult. The models trained in these cases perform poorly on a new test set collected separately because of the domain shift problem.

Lastly, we propose an alternative framework from the perspective of experimental design and present a case study with modeling bathymetry data editing.

Cover page: Parallel Boosting and Learning from Diverse Datasets