Diehr and Faaland developed an algorithm that finds the minimum sum of key length pagination of a scroll of n items, and which uses O(n lg n) time and O(n) space, solving a problem posed by McCreight. An improved algorithm is given which uses O(n) time and O(n) space.

# Your search: "author:Hirschberg, D. S."

## filters applied

## Type of Work

Article (14) Book (0) Theses (0) Multimedia (0)

## Peer Review

Peer-reviewed only (14)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (0) UC Davis (0) UC Irvine (14) UCLA (0) UC Merced (0) UC Riverside (0) UC San Diego (0) UCSF (0) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (0) Lawrence Berkeley National Laboratory (0) UC Agriculture & Natural Resources (0)

## Department

Donald Bren School of Information and Computer Sciences (14)

## Journal

## Discipline

Physical Sciences and Mathematics (14)

## Reuse License

## Scholarly Works (14 results)

An efficient algorithm is presented that solves a generalization of the Longest Common Subsequence problem, in which one of the two input strings contains sets of symbols which may be permuted. This problem arises from a music application.

The least weight subsequence (LWS) problem is introduced, and is shown to be equivalent to the classic minimum path problem for directed graphs. A special case of the LWS problem is shown to be solvable in O(n log n) time generally and, for certain weight functions, in linear time. A number of applications are given, including an optimum paragraph formation problem and the problem of finding a minimum height B-tree, whose solutions realize improvement in asymptotic time complexity.

Let I be the set of intervals with end points in the integers 1 ... n. Associated with each element in I is a value from a commutative semigroup S. Two operations are to be implemented: update of the value associated with an interval and query of the sum of the values associated with the intervals which include an integer.

If the values are from a commutative group (i.e., inverses exist) then there is a data structure which enables both update and query algorithms of time complexity O(log n). For the semigroup problem, the use of range trees enables both update and query algorithms of time complexity O(log^2 n).

Data structures are presented with (update,query) algorithms of complexities (log^2 n, log n), (log n, log^2 n), (log n log log n, log n log log n).

Several methods (independent subsamples, leave-one-out, cross-validation, and bootstrapping) have been proposed for estimating the error rates of classifiers. The rationale behind the various estimators and the causes ofthe sometimes conflicting claims regarding their bias and precision are explored in this paper. The biases and variances of each of the estimators are examined empirically. Cross-validation, 10-fold or greater, seems to be the best approach, the other methods are biased, have poorer precision, or are inconsistent. (Though unbiased for linear discriminant classifiers, the 632b bootstrap estimator is biased for nearest neighbors classifiers, more so for single nearest neighbor than for three nearest neighbors. The 632b estimator is also biased for CART-style decision trees. Weiss' LOO* estimator is unbiased and has better precision than cross-validation for discriminant and nearest neighbors classifiers, but its lack of bias and improved precision for those classifiers do not carry over to decision trees for nominal attributes.

Several methods (independent subsamples, leave-one-out, cross-validation, and bootstrapping) have been proposed for estimating the error rates of classifiers. The rationale behind the various estimators and the causes of the sometimes conflicting claims regarding their bias and precision are explored in this paper. The biases and variances of each of the estimators are examined empirically. Cross-validation, 10-fold or greater, seems to be the best approach, the other methods are biased, have poorer precision, or are inconsistent. (Though unbiased for linear discriminant classifiers, the 632b bootstrap estimator isbiased for nearest neighbors classifiers, more so for single nearest neighbor than for three nearest neighbors. The 632b estimator is also biased for CART-style decision trees. Weiss LOO* estimator is unbiased and has better precision than cross-validation for discriminant and nearest neighbors classifiers, but its lack of bias and improved precision for those classifiers do not carry over to decision trees for nominal attributes.

A class of algorithms is given for maintaining self-organizing sequential search lists, where the only permutation applied is to move the accessed record of each search some distance towards the front of the list. During searches, these algorithms retain a back-pointer to a previously probed record in order to determine the destination of the accessed record's eventual move. The back-pointer does not traverse the list, but rather it is advanced occationally to point to the record just probed by the search algorithm. This avoids the cost of a second traversal through a significant portion of the list, which may be a significant savings when each record access may require a new page to be brought into primary memory. Probabilistic functions for deciding when to advance the pointer are presented and analyzed. These functions demonstrate average case complexities of measures such as asymptotic cost and convergence similar to some of the more common list update algorithms in the literature. In cases where the accessed record is moved forward a distance proportional to the distance to the front of the list, the use of these functions may save up to 50% of the time required for permuting the list.

Several techniques for estimating the range of uncertainty of estimated error rates and for estimating the significance of observed differences in error rates are explored in this paper. Textbook formulas which assume a large test set (i.e., a normal distribution) are commonly used to approximate the confidence limits of error rates or as an approximate significance test for comparing error rates. Expressions for determining more exact limits and significance levels for small samples are given here, and criteria are also given for determining when these more exact methods should be used. The assumed normal distribution gives a poor approximation to the confidence interval in most cases, but is usually useful for significance tests when the proper mean and variance expressions are used. A commonly used ±2σ significance test uses an improper expression for σ, which is too low and leads to a high likelihood of Type I errors. Common machine learning methods for estimating significance from observations on a single sample may be unreliable.

Several methods (independent subsamples, cross-validation, and bootstrapping) have been proposed for estimating the error rates of classifiers. The power of the various estimators (i.e., their variances and confidence limits, their ability to test the null hypothesis) has received relatively little attention in the machine learning literature. The biases and variances of each of the estimators are examined empirically. Cross-validation, 10-fold or greater, is seen to be superior, the other methods are biased, have poorer variance, or are prohibitively time-consuming. Textbook formulas which assume a large test set (i.e., a normal distribution) are commonly used to approximate the confidence limits of error rates or as an approximate significance test for comparing error rates. Expressions for determining more exact limits and significance levels for small samples are given here, and criteria are also given for determining when these more exact methods should be used. The normal distribution is a poor approximation to the confidence interval in most cases, but is usually useful for significance tests when the proper mean and variance expressions are used. A commonly used ±2𝝈 test uses an improper expression for 𝝈, which is too low and leads to a high likelihood of Type I errors.

Several techniques for estimating the range of uncertainty of estimated error rates and for estimating the significance of observed differences in error rates are explored in this paper. Textbook formulas which assume a large test set (i.e., a normal distribution) are commonly used to approximate the confidence limits of error rates or as an approximate significance test for comparing error rates. Expressions for determining more exact limits and significance levels for small samples are given here, and criteria are also given for determining when these more exact methods should be used. The assumed normal distribution gives a poor approximation to the confidence interval in most cases, but is usually useful for significance tests when the proper mean and variance expressions are used. A commonly used ±2