Graphlet Analysis Of Networks
- Author(s): Maharaj, Sridevi Kamla
- Advisor(s): Hayes, Wayne
- et al.
Over the past decade, the study of graphlets has emerged as a useful tool in the study of networks. Graphlets are small, induced, connected subgraphs on a larger graph. They aid in quantifying the structure of networks, identifying functionality of sub-regions within a network, and understanding how relationships and interactions represented by the network may evolve over time. In this study, we approach a variety of problems from the vantage point of graphlets. First, we model the structure of BioGRID protein-protein interaction (PPI) networks using different network models and assess their fitness primarily via their graphlet topology. We also briefly explore the relationships between various network comparison measures and find there is little agreement between them. Despite this disagreement, we find that the models most suited to modeling PPI networks have changed as the data evolve and that it is the STICKY and scale-free based models which best fit current PPI networks. Secondly, as the PPIs (and other networks) are constantly updated, they become larger and denser. As network data is growing in volume (some networks have on the order of hundreds of thousands of nodes and millions of edges), exhaustive enumeration of graphlets becomes infeasible and hence, in order to perform graphlet analysis on networks, we must sample graphlets from networks. We explore four different graphlet sampling techniques. We highlight the advantages and disadvantages of different sampling approaches and show that a Markov Chain Monte Carlo approach to graphlet sampling is best for approximating the proportion of each graphlet within a network. In addition, we find that though they have different biases, all four techniques are able sample a sufficiently diverse set of graphlets that graphlet-based network comparison measures are still able to distinguish between different network types. From sampling, we are also able to distinguish graphs of much higher density, which would ordinarily take weeks or months of CPU time to fully enumerate. Thirdly, we demonstrate a need for the development of synthetic network generators by showing that that many state-of-the-art synthetic network generators do not consistently reproduce networks matching traditional properties of real-world networks and graphlet distributions of the real-world networks. Finally, we analyze the brain fMRI connectivity in children with OCD and a control (CON) group. We find that the network connectivity is completely different in both networks and we highlight some of their differences, including hub nodes, different correlations between connectivity paths and physical distances in the brain, the occurrence of cliques and circuits and graphlet distributions. We also model the OCD and CON connectivity networks using theoretical models and find that the CON network is fit best by a Geometric with Gene Duplication and Mutation network while the OCD network has no best fit.