Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Exploring Frontiers in Graph Learning

No data is associated with this publication.
Creative Commons 'BY-NC-SA' version 4.0 license
Abstract

Graphs have been used to model many different types of data, ranging from social networks to the human brain. We can then formulate real-world problems as tasks (like link prediction or node classification) on these graphs. For example, item recommendation can be viewed as link prediction on a user-item purchase graph and bot detection as node classification on a social interaction graph. Traditionally, these graph tasks have been solved through statistical or spectral analysis. However, relatively recent works have proposed the idea of Graph Neural Networks (GNNs), which allow us to solve these problems in a highly scalable and effective manner.

In our recent work, we focus on exploring three frontiers in graph learning. First, we bridge ideas from different domains and apply them to graph learning to define new models and methods. Second, we work on solving tasks in a fast and space-efficient manner, improving their practical utility. Finally, we revisit and challenge established concepts in graph learning to see if they are really true. This dissertation summarizes four pieces of work, each of which fits into two or more of the aforementioned frontiers.

The first work in our dissertation, TenGAN, examines the task of multiplex graph generation - generating graphs that have multiple views, each with different edges but the same nodes. This is a task that is typically solved with statistical preferential attachment models. However, these models are inflexible and require the explicit modeling of desired graph attributes. Modeling these multiplex graphs with a naive generative model is also difficult due to the large number of parameters required. We propose borrowing ideas from tensor decompositions to implicitly compress the parameters in the network, greatly reducing the number of parameters required.

Our second work shifts its focus towards efficient link prediction with GNNs. Currently, most scalable method is node-based link prediction, where we compute node embeddings for candidate pairs and use them to compute the probability of a link existing between them. Most existing state-of-the-art methods rely on contrastive learning, which uses both positive samples (e.g., neighbors) and negative samples (e.g., non-neighbors). However, these negative samples are often expensive to obtain. A recent group of methods, dubbed non-contrastive learning, have been proposed to avoid the expensive negative sampling step but have only been evaluated on node classification tasks. In this work, we perform a detailed benchmark of existing non-contrastive methods on link prediction, discover a key limitation, and mitigate it by proposing a simple modification. This results in an improvement of up to 120% in Hits@50 when compared to existing non-contrastive methods and a 14× speedup over contrastive methods.

Our third work, CARL-G, also focuses on speeding up self-supervised node representation learning. We do this by observing some similarities between contrastive learning and clustering. We then propose a new framework and loss function that reformulates node representation learning as a clustering problem. Our key contribution is finding that an existing class of unsupervised clustering evaluation metrics known as Clustering Validation Indices (CVIs) can serve as effective substitutes for existing contrastive losses. With the choice of specific CVIs, we can reduce the cost of computing the loss function to linear time (in contrast to the quadratic time complexity of most contrastive losses). Then, also taking advantage of decades of research in speeding up clustering, we can compute better node representations in much less time. CARL-G trains 79× than the best-performing node classification baseline and 1,500× faster than the best-performing node clustering and similarity search baseline.

Our fourth and final work focuses on improving the inductive capabilities of recommendation systems in a fast and space-efficient manner. Recommendation models typically store an embedding for each unique user and item in embedding tables. However, when performing inference with real-world recommendation systems, we often encounter users/items that were not present during training. The most common solutions to this are to either randomly use an embedding from an out-of-vocabulary (OOV) embedding table or a fixed vector like the mean embedding. Our work challenges the assumption that these simple methods are sufficient, and we explore 9 different OOV embedding methods on 5 different models to find more effective approaches. We find that locality-sensitive-hashing (LSH) based methods generally perform better than the competing methods resulting in a 3.74% mean improvement over the industry-standard baseline.

Main Content

This item is under embargo until October 22, 2025.