Several methods (independent subsamples, cross-validation, and bootstrapping) have been proposed for estimating the error rates of classifiers. The power of the various estimators (i.e., their variances and confidence limits, their ability to test the null hypothesis) has received relatively little attention in the machine learning literature. The biases and variances of each of the estimators are examined empirically. Cross-validation, 10-fold or greater, is seen to be superior, the other methods are biased, have poorer variance, or are prohibitively time-consuming. Textbook formulas which assume a large test set (i.e., a normal distribution) are commonly used to approximate the confidence limits of error rates or as an approximate significance test for comparing error rates. Expressions for determining more exact limits and significance levels for small samples are given here, and criteria are also given for determining when these more exact methods should be used. The normal distribution is a poor approximation to the confidence interval in most cases, but is usually useful for significance tests when the proper mean and variance expressions are used. A commonly used ±2𝝈 test uses an improper expression for 𝝈, which is too low and leads to a high likelihood of Type I errors.