Skip to main content
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Towards Accurate and Scalable Clone Detection using Software Metrics

Creative Commons 'BY' version 4.0 license

Code clone detection tools find exact or similar pieces of code, known as code clones. Code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type I) to purely semantic (Type IV). Most clone detectors reported in the literature, work well up to Type III, which accounts for syntactic differences. In between Type III and Type IV, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect { the Twilight Zone. Besides correctness, scalability has become a must-have requirement for modern clone detection tools. The increase in amount of source code in web-hosted open source repository services has presented opportunities to improve the state of the art in various modern use cases of clone detection such as detecting similar mobile applications, license violation detection, mining library candidates, code repair, and code search among others. Though these opportunities are exciting, scaling such vast corpora poses critical challenge.

Over the years, many clone detection techniques and tools have been developed. One class of these techniques is based on software metrics. Metrics based clone detection has potential to identify clones in the Twilight Zone. For various reasons, however, metrics-based techniques are hard to scale to large datasets. My work highlights issues which prohibit metric based clone detection techniques to scale large datasets while maintaining high levels of correctness. The identification of these issues allowed me to rethink how metrics could be used for clone Detection.

This dissertation starts by presenting an empirical study using software metrics to understand if metrics can be used to identify differences in cloned and non-cloned code. The study is followed by another large scale study to explore the extent of cloning in GitHub. Here, the dissertation highlights scalability challenges in clone detection and how they were addressed. The above two studies provided a strong base to use software metrics for clone detection in a scalable manner. To this end, the dissertation presents Oreo, a novel approach capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. This dissertation evaluates the recall of Oreo on BigCloneBench, a benchmark of real world code clones. In experiments to compare the detection performance of Oreo with other five state of the art clone detectors, we found that Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity, in a scalable manner. Further, to address the issues identified in precision evaluations, the dissertation presents InspectorClone, a semi automated approach to facilitate precision studies of clone detection tools. InspectorClone makes use of some of the concepts introduced in the design of Oreo to automatically resolve different types of clone pairs. Experiments demonstrate that InspectorClone has a very high precision and it significantly reduces the number of clone pairs that need human validation during precision experiments. Moreover, InspectorClone aggregates the individual effort of multiple teams into a single evolving dataset of labeled clone pairs, creating an important asset for software clone research. Finally, the dissertation concludes with a discussion on the lessons learned during the design and development of Oreo and lists down a few areas for the future work in code clone detection.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View