An exact probability metric for decision tree splitting and stopping
ID3's information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated by the gain ratio used in C4.5. Several other alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). Gain ratio and orthogonality are strongly correlated, and all of these metrics are biased towards splits with one or more small expected values, under circumstances where the split likely ocurred by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis. Both gain and the chi-squared tests are shown to arise in asymptotic approximations to the hypergeometric, revealing similar criteria for admissibility and showing the nature of their biases. Previous failures to find admissible stopping rules in CART and IDS are traced to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. Empirical results show that hypergeometric pre-pruning should be done, as trees pruned in this way are more practical, simpler, more efficient, and generally no less accurate than unpruned or post-pruned trees.