Bayesian Hierarchical Model of the Browsing Behavior of World Wide Web Users
We consider the case of surfing within a single large Web site, which is important from the point of view of site design, web server proxy e ciency and search engine optimal ranking of pages. The site used as an example to illustrate a method for clustering user sessions that we propose is msnbc.com. We use a random sample from a publicly available server log data on the Web pages chosen by 989818 users in a twenty-five hour period, where the response measure for each user is an ordered sequence of choices among 17 categories (UCI KDD Archive). A common way to model the browsing behavior of users is to assume that the decision of users is a random walk with a probability distribution of first passage time to a threshold that is a two-parameter inverse-gaussian distribution. Another hypothesis examined in the literature is that users at each page conduct an independent Bernoulli trial to make a stopping decision, which implies a geometric distribution. Mixtures of first-order Markov processes or model-based clustering with and without a Bayesian flavor have o ered very useful exploratory data analysis. All these studies have shown evidence that web-surfing behavior may be non-Markov in nature and have illustrated how hard it is to capture dependencies in the data. The performance of the models over a wide range of Web Site formats is still inconclusive. This performance has been measured by the ability to predict page hits, by the resulting distribution of page hits, and by the contribution to e cient web caching schemes. Some models have been tested with server log data of AOL or similar Sites and others have been tested within a single Web site like msnbc.com. The levels of aggregation of pages and clustering of user behavior have also varied within studies. In this paper, we assume that for the case of browsing within a news portal like msnbc.com, where contents are continually changing, the server-log data is only meaningful when categories are aggregated, like they are for the msnbc.com data set, and the order of the browsing may not be relevant. We use a Bayesian hierarchical model of the page counts per user to obtain posterior distributions of page access frequency that allow us to cluster user sessions in a relatively small number of groups. The model has the ability to have enough parameters to fit the data well, while using a population distribution that can structure dependence in the parameters. The model can be generalized to di erent types of Web sites, di erent levels of aggregation of pages and di erent clustering schemes.