The subgraph matching problem arises in a variety of domains including pattern recognition for segmented images, meshes of 3D objects, biochemical reactions, and security applications.Large and complex solution spaces are common for this graph-based problem, especially when the world graph contains many more nodes and edges in comparison to the template graph.
Overall, this dissertation covers three parts: using symmetry to boost the subgraph matching algorithm and compress the solution space; presenting an active learning problem for subgraph matching problem; and discussing the different quantitative strategies for active learning.
As for the symmetry, we introduce rigorous definitions of structural equivalence and establish conditions for when it can be safely used to generate more solutions.We then adapt a state-of-the-art solver and perform a comprehensive series of tests to demonstrate how the Venn-diagram could be applied to visualize the symmetry and compress the solution space.
Next, a real use-case scenario may require analysts to query additional information to reduce the complexity of the problem and there is currently little guidance to analysts on how to optimize this choice. By analogy to the well-known active learning problem in machine learning classification problems, we present an active learning problem for the subgraph matching problem in which an algorithm suggests optimal template target nodes that would be most likely to reduce the overly large solution space. Additional information about those target nodes are then acquired from humans in the loop, in an iterative process. We present some case studies illustrating different strategies on a variety of datasets.
We further develop this idea by introducing rigorous mathematical definitions of the active learning problem. We also prove NP completeness of this problem.We present numerical experiments for single channel and multichannel subgraph matching problems created from both synthetic and real world datasets.
We compare different quantitative criteria for choosing nodes to query. We introduce a new method based on a spanning tree that outperforms other graph-based criteria for the multichannel datasets.