Robust Machine Learning Directed Toward Pathology Imaging Applications with Limited Labeled Data
- Lai, Zhengfeng
- Advisor(s): Chuah, Chen-Nee
Abstract
Neurodegenerative diseases are defined by a progressive loss of selectively vulnerable neuronal populations. Hence, augmenting the ability of a pathologist/expert for the detection/localization of specific pathologies with data-driven deep learning (DL) approaches could have a transformative impact. DL models trained on high-quality annotated datasets can provide a cost-effective and reliable means for deeper phenotyping of neurodegenerative and neurologic diseases. Leveraging digital slide scanners such as the Leica Aperio AT2 and Zeiss Axio Z1, physical tissue slides can be digitized into gigapixel digital whole slide images (WSIs) containing a rich set of information, including diverse morphological patterns, cellular features, and tissue architecture. However, there are severe challenges to fully realize the digitalization and automation of detecting these pathologies and neurodegenerative features. First, the tissue sections can be scanned at ×20 or ×40 objective magnification, resulting in ultra-high resolution WSIs that require a large amount of memory space for storage and processing. To alleviate this issue, we design a general patch-based approach and use plaque quantification in grey and white matter as one example. In this study case, our automated framework can classify, localize, quantify, and visualize the distribution of each type of plaque in grey and white matter, separately. We have tested it across different scanners, brain areas, and storage formats to benchmark its generalizability.
Second, although the proposed supervised learning pipelines can achieve promising results, their performance heavily relies on the quality and quantity of the data and annotations. However, manual annotation to build comprehensive and well-annotated digital pathology datasets can be time-consuming and labor-intensive, which is not scalable for diverse pathological tasks. Therefore, to relieve the heavy reliance on the labeled data, we investigate semi-supervised learning (SSL) and propose four SSL frameworks to enhance the applicability and deployment of SSL algorithms. For example, we propose an imbalanced SSL but make no assumption on the distribution of the unlabeled data; we improve pseudo-labeling and design a deep fusion of SSL and active learning to further reduce the labeling efforts.
Third, besides the annotation cost, data collection can also be expensive and challenging due to the scarcity of pathology data and the heterogeneity of pathology datasets, which can vary significantly regarding image resolution, staining techniques, and scanning equipment. Therefore, to reduce the data collection effort, we study unsupervised domain adaptation (UDA) and efficient adaptation of large-scale pre-trained vision-language models (VLMs). UDA proposes to reduce data annotation costs by leveraging a labeled source domain to transfer the knowledge into an unlabeled target domain. We propose a UDA pipeline to bridge the domain gap and make domain adaptation data and label-efficient. On the other hand, VLMs have shown promising results in their transferability across different downstream tasks. However, traditional fine-tuning may involve billions of parameters to be tuned, which is computation-expensive and has overfitting issues when the downstream task with limited data availability. We propose a parameter-efficient pipeline to adapt VLMs to multiple downstream tasks in a scalable way in terms of processing time and computational cost. In conclusion, our proposed pipelines can potentially achieve data and label-efficient learning for diverse pathological imaging tasks.