Data Augmentation Policies for Cancer Classification
- Author(s): Vyas, Aditya
- Advisor(s): Brody, James
- et al.
Data augmentation is a beneficial technique to improve the performance of modern classifiers. However, current data augmentation implementations are mostly for image-related data. In this dissertation, two data augmentation policies have been created to augment non-image limited data. The policies are modeled after the measurement errors found in scientific measurements and aims to increase the limited data by inducing those errors. The second purpose of this dissertation is to study the role of Copy Number Variations (CNVs) in cancer manifestation. CNVs are structural variations, and they can be useful to identify who will develop particular cancers. The germ line data is collected from The Cancer Genome Atlas Program (TCGA), and contemporary machine learning models are being used to classify cancer types. However, these classification models perform poorly due to limited availability of data, thereby, the data augmentation policies are used to increase data diversity. The usage of these policies has significantly improved the classifier's performance, giving an improvement of ≈3-4%, which can be highly beneficial for future research. These policies can be used to find dominant chromosomal regions which are correlated with respective cancer type, thereby giving more insights to the medical community for further analysis.