A gradient boosting machine algorithm to predict age of glioblastoma incidence with copy number variation data
Glioblastoma multiforme (GBM) is the most common form of brain cancer. The exact cause of GBM is not well understood. In this thesis, we tested whether germline genetic information could predict who will develop GBM and when will they develop it. We first extracted copy number variation (CNV) data from germline DNA in the peripheral blood samples of 8826 patients in the The Cancer Genome Atlas (TCGA) database. We compared that to 8338 patients in the database who did not develop GBM. We used several machine learning algorithms: deep learning, gradient boosting machine and random forest methods to test whether the germ line genetic data could predict who would develop GBM. The gradient boosting machine algorithm achieved the best results with an 0.82 AUC. We then used this gradient boosting method to test whether germ line DNA information could predict the age of diagnosis of GBM patients. We compared the correlation coefficient between the predicted age and actual age for GBM patients to the predicted correlation coefficient measured for randomized control groups and found a significantly better prediction in the GBM patient group (p-value 0.0004). These results suggest that who develops glioblastoma and when they are diagnosed with glioblastoma is influenced by germline genetics.