This thesis presents the experimentation of parameters affecting the tokenization process of an existing code clone detection tool, SourcererCC. The SourcererCC is a token-based clone detector that targets three clone types and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. We experiment with the three parameters affecting tokenization: (1) threshold, (2) stop words, and (3) use of sub-tokens. I will evaluate these three parameters' performance with the original SourcererCC. I will be using the metrics, precision, and recall for the evaluation. I create a web interface for the SourcererCC and use that to conduct preliminary experiments for the parameters and find the best results with an 80 percent threshold.
The experiments conducted for the evaluation include (1) Original SourcererCC, (2) SourcererCC with sub-tokens enabled, (3) SourcererCC CC with stop words, and (4) SourcererCC with sub-tokens and stop words. The experiments are evaluated with a recall study using the BigCloneEval tool and manual verification of the precision of the experiments. For the manual verification, 150 samples are selected from each experiment, using four judges to remove any biases. I further analyze the results achieved and the scope of future work for this thesis.