Multi-feature ensemble learning on cell-free DNA for accurately detecting and locating cancer
Early cancer detection and localization using cell-free DNA (cfDNA) faces multiple challenges, including the low fraction of tumor DNA in cfDNA and the molecular heterogeneity of cancer. Many features have been used to detect cancer in cfDNA, such as fragment length profiles, copy number changes, and microbial composition, but methylation in particular has been found to detect cancer early. Additionally, the tissue specificity of methylation has aided noninvasive cancer typing efforts. Typically, cfDNA methylation profiling is done through whole genome bisulfite sequencing (WGBS) or targeted approaches, but these protocols are plagued by high cost or require prior knowledge of informative regions. Another procedure, reduced representation bisulfite sequencing (RRBS), strikes a balance between these two extremes, but is only applicable to intact genomic DNA, not naturally fragmented cfDNA. Herein, we develop an integrated cancer detection and typing system, CancerRadar, that addresses these challenges. First, we present a novel protocol, cell-free Methylation Sequencing (cfMethylSeq), which adapts the RRBS protocol to be applicable to cfDNA. We show cfMethylSeq yields more than 12-fold enrichment over WGBS in CpG islands while reliably and reproducibly quantifying methylation and capturing broad, genome-wide signals. Next, we develop a computational platform to extract information from cfMethylSeq data and diagnose the patient. The platform derives cfDNA methylation, cfDNA fragment sizes, copy number changes, and microbial composition from the raw cfMethylSeq data, and performs multi-feature ensemble learning.
We demonstrate the power of CancerRadar in detecting and locating cancer in a cohort of 275 colon, liver, lung, and stomach cancer patients and 204 non-cancer individuals. For cancer detection, we achieved a sensitivity of 89.1% at 97% specificity in the independent validation set. For cancer typing, we achieved an accuracy of 91.5% in the independent validation set. We further show that integrating multiple features significantly increases the detection power, especially for early-stage cancer. Our novel protocol and computational procedure have the potential to revolutionize cancer detection and methylation analyses in cfDNA, and the data generated will be hugely beneficial to the cfDNA research community.