Colorectal cancer is the third most common type of cancer in the world. Colorectal cancer begins with small, noncancerous clumps of cells (polyps). Early screening, with colonoscopies, and surgical removal of polyps can prevent the cancer from ever developing. However, colonoscopies are expensive and painful, and the lifetime risk of colon cancer is only 5%. A test to identify people who are likely to develop colon cancer could eliminate needless colonoscopies.
We obtained germline genetic data for 1309 patients diagnosed with colon cancer and compared this to 7517 others who have never been diagnosed with colon cancer. This dataset was collected as part of the Cancer Genome Atlas Program. We used supervised machine learning on this dataset to answer two questions. First, what fraction of these colon cancer patients should be predictable from germline data? Second, how well could such a test perform. We evaluated the performance of five different machine-learning algorithms (gradient boost machine, wide neuron networks, deep neuron networks, dense-sparse neuron networks, and pairwise neuron networks) to answer these questions.
We found that about 78% of colon cancer cases in the dataset should be predictable from germline genetic data. We measured the receiver operating characteristic curve, which quantifies the tradeoff between sensitivity and specificity, for a germline genetic test that could predict a future diagnosis of colon cancer. We measured the area under the receiver operating characteristic curve to be about 0.80. We found that the gradient boost machine and pairwise neuron network algorithms perform equally well, and these two models were significantly better than the others. We conclude that a germline genetic test to predict a future diagnosis of colon cancer could be useful to focus screening on appropriate populations.