Breast Cancer Prediction from Genome Segments with Machine Learning
- Author(s): Tong, Xinhan
- Advisor(s): Brody, James
- et al.
Breast cancer is the most common diagnosed cancer for the worldwide women. Due to the multiformity of the clinical behaviors, it is difficult to predict and diagnosed only with clinical information. In order to find out a better solution to make some prediction in the early stage, the genome wide analysis is introduced. In this paper, we got the dataset from The Cancer Genome Atlas (TCGA) database to find a best predictive machine learning model. Since the copy number variations (CNVs) is highly related with the breast cancer, CNVs is used as a fundamental indicator of each genome segmentation in the study. Based on the start and the end positions, the datasets can be sorted and reorganized into five grouping sets. We tested the predictive power of the Gradient Boosting Machine, Distributed Random Forest, XGBoost and Deep Learning Neural Network. With the different genome segmentation grouping dataset and different machine learning models, we finally found the Gradient Boost Machine is the most powerful model for this problem. It can finally reach AUC of 0.756799 after 15-fold cross validation trained with “merged” grouping dataset.