Decision trees (DT) are considered to be one of the oldest machine learning models which received a lot of attention from practitioners and research community. Although their roots are in the 1950s, they became popular in the early 1980s with developing popular methods, such as CART and C4.5. They are conceptually simple yet powerful. State-of-the-art frameworks, such as XGBoost or LightGBM, rely on them as base learners, but they have been used as well as standalone predictors.
Despite the rich history of decision trees and existence of numerous methods, their applicability beyond traditional supervised learning has been explored in limited extent. For instance, various fast growing ML subfields, such as semi-supervised and self-supervised learning, nonlinear dimensionality reduction (e.g. nonlinear embeddings), etc. have been barely used with trees. What is common to most of these tasks is that the objective function takes a certain form, which involves manifold regularization to exploit the geometry of the underlying data distribution. For example, a common assumption is that similar instances have similar predictions. However, incorporating this type of regularization is non-trivial with decision trees. The main reason is that the learning decision trees from data (even in supervised learning setup) is still an open research problem because it involves solving non-trivial combinatorial optimization problem. In fact, finding an optimal decision tree or even constant-factor approximation is NP-hard in most formulations of the problem. Adding manifold regularization term into an overall objective makes the problem even harder to solve.
In this dissertation, we study decision trees and, more generally, tree-based models under above-mentioned settings, i.e., involving manifold regularization. We argue that these type of problems carry a great practical importance but directly solving them is intractable (for decision trees). Using semi-supervised learning and nonlinear dimensionality reduction as examples, we derive a generic algorithm to solve such optimization problems. It is based on a reformulation of the problem which requires iteratively solving two simpler problems. One problem always involves supervised learning of a tree, which we solve by the Tree Alternating Optimization algorithm. The second one depends on a particular problem type and can be one of the following: solving a linear system, optimizing non-linear embeddings or something else. We also show that the algorithm is scalable and highly effective on number of practical tasks.