Internet Usage Mining Using Random Forests
- Author(s): Liu, Xuening
- Advisor(s): Schoenberg, Frederic P
- et al.
Nowadays Internet service providers gradually realize that in order to attract and keep more users they'll need more than just a hunch. Especially as larger and larger data emerges, data mining is finally in the spotlight. This thesis is about Internet usage analysis using random forests, which is a highly efficient and widely used machine learning algorithm both in academia and real world. The data set here is new users' behaviors on a website T. Everyday, thousands of people sign up, however, not all of them continue using T. The aim is to find out what makes some of the new users decide to leave. Variable importance in random forests is really helpful in this situation. However it tends to be biased when dealing with multiple types of input variable, and correlated variables. To fix the bias, additional implementations of random forests are needed, such as using conditional inference trees, and conditional permutation. As a result, there are some interesting findings, such as actual usage related variables are significant, and the number of clicks from search is also important.