Does your dermatology classifier know what it doesn't know? Detecting the long-tail of unseen conditions.

Abhijit Guha Roy,Jie Ren,Shekoofeh Azizi,Aaron Loh,Vivek T. Natarajan,Basil Mustafa,Nick Pawlowski,Jan Freyberg,Yuan Liu,Zachary Beaver,Nam Vo,Peggy Bui,Samantha Winter,Patricia MacWilliams,Greg S. Corrado,Umesh Telang,Yun Liu,A. Taylan Cemgil,Alan Karthikesalingam,Balaji Lakshminarayanan,Jim Winkens

Does your dermatology classifier know what it doesn't know? Detecting the long-tail of unseen conditions.

2022

Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real-world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To prevent models from generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent 'outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between the model training, validation, and test sets. Unlike traditional OOD detection benchmarks where the task is to detect dataset distribution shift, we aim at the more challenging task of detecting subtle differences resulting from a different pathology or condition. We propose a novel hierarchical outlier detection (HOD) loss, which assigns multiple abstention classes corresponding to each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD loss based approach outperforms leading methods that leverage outlier data during training. Further, performance is significantly boosted by using recent representation learning methods (BiT, SimCLR, MICLe). Further, we explore ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also perform a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrate the gains of our framework in comparison to baseline. Furthermore, we go beyond traditional performance metrics and introduce a cost matrix for model trust analysis to approximate downstream clinical impact. We use this cost matrix to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations