Despite diagnostic advancements, the development of reliable methods for assessing the risk of cancer occurrence still remains a challenge. Effective risk assessment models can improve monitoring and increase change of early detection and intervention. Existing risk estimate models rely primarily on data collected from single institute and often lack racial and ethnic diversity. In addition, many existing statistical models do not sufficiently incorporate inheritance factors. With the recent advancements in genetics, big data and artificial intelligence, precision medicine can become a reality. In this study we leveraged the available data from the largest cancer databases to develop machine learning models for predicting cancer occurrence.
In this work, we developed a novel framework for extracting recurrence cases from the SEER dataset and identified cases within a 5-year and 10-year period. Machine‑learning prediction models for oral tongue squamous cell carcinoma (OTSCC) cancer recurrence was then developed based on sociodemographic and clinical variables. Among the top trained classification models, the Gradient Boosting Machine model performed the best, achieving 81.8% accuracy and 97.7% precision for 5‑year prediction. Moreover, 10‑year predictions demonstrated 80.0% accuracy and 94.0% precision.
In addition to the aforementioned model, we also explored a novel strategy that incorporates structural variations in germline DNA, specifically chromosomal scale-length variation (CSLV), to assess individuals' genetic risk scores. This approach enabled comprehensive analysis of copy number variations (CNVs) across large segments of the human genome, capturing variations that may contribute to the inheritance of cancer risk. The strategy was tested on two unique datasets, UK Biobank and NIH All of Us. The viability of the approach first evaluated by developing a machine learning model for predicting breast cancer recurrence based on data from UK-Biobank. The model developed based on CSLV values of 489 patients, all of whom were of white race and had experienced breast cancer recurrence, as well as a negative class consisting of age-matched and under-sampled patients from 13,478 cases who had not experienced breast cancer recurrence. The model showed an average AUC of 0.54 on unseen split of data, however, since the model was developed solely based on CSLV values, it could not comprehensively evaluate an individual's risk for breast cancer recurrence.
In order to determine whether CSLV could be used for developing risk assessment models for occurrence of cancer, we relied on the NIH All of Us dataset. The developed risk estimate model accurately evaluated individuals' risk of developing breast, colorectal, and oral cavity cancer solely based on calculated CSLV values. The AUC of the trained model on unseen split of data was 0.70, 0.68, and 0.69, respectively. By calculating the odds ratio relative to the whole population, we found that patients who were scored by the model in the top 10% were 14, 12, and 13 times more likely to develop that specific type of cancer. The diversity of the datapoints in the All of Us dataset allowed us to examine our developed model's performance for predicting an individual's risk of breast cancer across different races. This analysis provided valuable insights into the generalizability of our model among different racial groups.
In conclusion, the advancements in machine learning, next-generation sequencing, and big data have allowed the development of effective risk assessment models for various types of cancer. More importantly the techniques introduced in this work are easily translatable to the study of other complex diseases. We hope that this investigation encourages future studies that incorporate clinical, sociodemographic and genetic variables for detection and treatment of cancer. As healthcare datasets continue to grow in size and computational power continues to increase, there is, without a doubt, great promise for significant strides in precision medicine and personalized healthcare.