Code Search using Code2Seq
The rapid development of software has led to the existence of a number of large, complex and swiftly growing codebases consisting of thousands of source code lines. Therefore, the process of searching for code that performs a particular function has become an inherent part of the process of software development today. Developers often use general purpose search engines like Google or Q&A sites such as Quora and StackOverflow to search for relevant examples, which are not dedicated specifically to code search. In addition to this, code that is proprietary to a particular company or organization will not be available on these public platforms. In order to address these challenges, various semantic code search approaches based on information retrieval and deep learning techniques have been proposed which allows a user to search a code repository using natural language queries. However, information retrieval based code search systems rely on keywords and may not return relevant results if the query keyword is not present in the search documents. Deep learning approaches are able to retrieve code snippets that are similar to the user query even if the exact keywords aren't present but they treat source code as natural language and do not take into account the intrinsic semantic and syntactic information of source code. In this thesis, I aim to develop a hybrid semantic code search system that combines a neural model which leverages the syntactic properties of software artifacts to generate comments that automatically summarizes the function of the code snippet with information retrieval that returns methods that are similar to the user query by computing the cosine similarity between the query input vector and the machine generated comment vectors present in the search corpus. Specifically, I use Code2Seq which represents a code snippet as an aggregation of individual paths in its Abstract Syntax Tree and learns the relevance of paths using attention in order to generate the target comment sequence. The code snippets along with the automatically generated comments form the IR search space which is used to find and retrieve code snippets that are relevant to the user query. The dataset, preprocessing steps involved, system design and implementation details are discussed in depth. Finally, the proposed system is evaluated through a precision study which shows that the top result returned is relevant to the user query 40% of the time.