With the rise of the Internet, social computing and numerous mobile applications have brought about a large amount of unstructured entity data, including places, events, and things. On one hand, the entity data, populated by a variety of data sources, is growing at an ever increasing speed. On the other hand, the service suppliers expect to render the entities pertinent to the things present in the information need of the users, rather than the data that merely matches the request strings. To meet these challenges, knowledge graph has been widely adopted in practice and become a fundamental building block for many commercial products from better recommendation systems to enhanced search engines.
As the knowledge graph subsumes humongous valuable content, there is an emerging need for the querying techniques that can extract and deliver the information from the data to the users. However, querying the real-life knowledge graph is not a trivial task. It has to deal with complex, unstructured graph data that can't fit a specific data model. Although there are many research efforts that aim in this direction, they typically have not addressed the real challenges: (1) In contrast to the complex and tedious graph data, the query from the end users tends to be ambiguous and underspecified. (2) It requires the semantic query understanding since the rigid string matching is often not desirable. (3) The querying system must scale to large heterogeneous graph data. This suggests the techniques that combine effective ranking and searching strategies.
In this thesis, we propose to tackle the challenges by primarily introducing a novel framework, Schemaless Graph Querying (SLQ), which supplies a flexible mechanism to querying large knowledge graphs. In essence, the query engine is built upon a set of transformation functions that automatically map keywords and linkages from a query to their matches in a graph. The end users are therefore not required to describe their queries precisely as that by most querying systems. To return the most relevant results in a fast way, SLQ also incorporates a learned ranking model that shall not rely on manually labeled data especially when the training data is difficult to obtain. This in turn give rise to the efficient top-K search techniques.
SLQ is designed inherently to better understand the user query intent. Rather than merely performing syntax based search, SLQ introduces the ontology base matching technique that can as well produce semantically related results. By leveraging an ontology index, an effective filtering mechanism is proposed to fast extract the top-K results based on the similarity quality. Moreover, a result summarization technique is invented to help the user inspect the excessive number of results: The users can enlarge and get detailed information on a summarized result view based on their search intents.
Another key problem in the context of graph querying is that this process needs to run under severe time constraints. The efficient search techniques are introduced from both the algorithm and the system aspects. In terms of the algorithm, we propose STAR, a framework for fast top-K searching for star queries. The technique is further extended to tackle general graph queries. From the system aspect, we investigate the distributed processing of large graphs in a cluster and propose the graph partitioning approaches. The technique is able to adapt in real time to changes in the query workload.
In summary, my research is introduced based on the hypotheses that as the knowledge graph becomes more available, there is a great chance of serving user queries in a better way. My contributions devote to realize this vision mainly for the applications of graph querying.