Improving Mining Performance for Internet Code Search Engines
- Author(s): Kwak, Thomas Minsoo
- Advisor(s): van der Hoek, André
- et al.
To aid programmers in searching for code on the Internet, researchers and developers have created code search engines. These search engines use ranking algorithms to sort their results based on properties of the code. However, obtaining the properties to use a ranking algorithm requires a resource-intensive process, including downloading, parsing, analyzing, and indexing large amounts of code. Additionally, few guidelines exist for reducing the resources needed for this process without affecting the performance of the ranking algorithm. We explore two techniques for improving the process of mining code from the Internet for code search engines. First, we introduce What-If Ranking Analysis, a novel technique that attempts to find cheaper versions of a ranking algorithm by reducing the number of properties required when ranking. Second, we modify the original Beowulf cluster to optimize on network throughput instead of CPU performance to examine whether the modified cluster can outperform a single high-performance server in mining and uploading code for use in a code search engine. Our findings show that more efficient ranking algorithms exist that perform as well as the original ranking algorithm while reducing the time spent mining by 44% and reducing the disk space used for storage by 41%. Further, we find that the modified Beowulf cluster mines and processes code three times faster than a high-performance server at approximately the same cost.