Chromosomal scale length variations as a genetic risk score for predicting complex human diseases in large scale genomic datasets
Next generation sequencing has created large databases of human genomic information. Utilizing this information to understand disease and genetic risks is a large engineering task. Previous studies have focused primarily on single nucleotide polymorphisms (SNPs) in assessing patient risk for diseases such as cancers and other diseases such as Schizophrenia. These SNP panels do not consider epistatic interactions in the human genome.Chromosomal scale-length variation (CSLV) is a promising approach for assessing genetic risk scores. CSLV evaluates copy number variations (CNVs), condensing genomic information into a smaller number of parameters. Reducing parameters allows the use of machine learning without the need for millions of patients’ data. Machine learning can consider epistatic interactions that might be missed by conventional genome wide association studies (GWAS). Utilizing machine learning classification algorithms, we assessed prediction of diseases such as ovarian cancer and schizophrenia using CSLV as the sole features for prediction. We have demonstrated the viability of this method in assessing germline inheritance of complex human diseases in The Cancer Genome Atlas (TCGA) and UK Biobank. We tested 33 different types of cancer from TCGA’s 11,000 patients. Glioblastoma multiforme (AUC = 0.87), ovarian cancer (AUC = 0.89), colon adenocarcinoma (AUC = 0.82), and breast invasive carcinoma (AUC = 0.75) could be distinguished greater than chance from cancers. These results were replicated the UK Biobank using 88 numbers computed from the 22 autosomes for 1,534 women with breast cancer and a control population of 4,391 women without breast cancer and found a classifier with an AUC of 0.83. 1,129 people from the UK Biobank have a diagnosis of schizophrenia. Using a randomized set of 1,129 individuals without schizophrenia we created 150 models using 92 number CSLVs as our feature set. The results provided an average AUC of 0.545 (95% CI 0.539-0.550). Our results indicate that CSLV data can provide an effective genetic risk score for schizophrenia. In conclusion, CSLV is a promising and novel way to utilize large scale human genetic information in the prediction off complex. Continued improvement of this technique can dramatically improve individualized patient care and can aid physicians in earlier diagnosis.