Clone detection locates exact or similar pieces of code, known as clones, within or between software systems. With the amount of source code increasing steadily, large-scale clone detection has become a necessity. Large code bases and repositories of projects have led to several new use cases of clone detection including mining library candidates, detecting similar mobile applications, detection of license violations, reverse engineering product lines, finding the provenance of a component, and code search.
While several techniques have been proposed for clone detection over many years, accuracy and scalability of clone detection tools and techniques still remains an active area of research. Specifically, there is a marked lack in clone detectors that scale to large systems or repositories, particularly for detecting near-miss clones where significant editing activities may have taken place in the cloned code.
The problem stated above motivates the need for clone detection techniques and tools that satisfy the following requirements: (1) accurate detection of near-miss clones, where minor to significant editing changes occur in the copy/pasted fragments; (2) scalability to hundreds of millions of lines of code and several thousand projects; and (3) minimal dependency on programming languages.
To that effect, this dissertation presents SourcererCC, an accurate, near-miss clone detection tool that scales to hundreds of millions of lines of code (MLOC) on a single standard machine. The core idea of SourcererCC is to build an optimized index of code blocks and compare them using a simple bag-of-tokens strategy, which is very effective in detecting near-miss clones. Coupled with several filtering heuristics that reduce the size of the index, this approach is also very efficient, as it reduces the number of code block comparisons to detect the clones.
This dissertation evaluates scalability, execution time, and accuracy of SourcererCC against four state-of-the-art open-source tools: CCFinderX, Deckard, iClones, and NiCad. To measure scalability, the performance of the tools is evaluated on inter-project software repository IJaDataset-2.0, consisting of 25,000 projects, containing 3 million files and 250 MLOC. To measure precision and recall, two recent benchmarks are used: (1) a benchmark of real clones, BigCloneBench, that spans the four primary clone types and the full spectrum of syntactical similarity in three different languages (Java, C, and C#); and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. The results of these experiments suggest that SourcererCC improves the state-of-the-art in code clone detection by being the most scalable technique known so far, with accuracy at par with the current state-of-the-art tools.
Additionally, this dissertation presents two tools built on top of SourcererCC: (i) SourcererCC-D: a distributed version of SourcererCC that exploits the inherent parallelism present in SourcererCC's approach to scale horizontally on a cluster of commodity machines for large scale code clone detection. Our experiments demonstrate SourcererCC-D's ability to achieve ideal speed-up and near linear scale-up on large datasets; and (ii) SourcererCC-I: an interactive and real-time version of SourcererCC that is integrated with the Eclipse development environment. SourcererCC-I is built to support developers in clone-aware development and maintenance activities. Finally, this dissertation concludes by presenting two empirical studies conducted using SourcererCC to demonstrate its effectiveness in practice.