Improving Uncertainty Estimation in Convolutional Neural Networks Using Inter-rater Agreement

2019 
Modern neural networks are pushing the boundaries of medical image classification. For some tasks in dermatology, state of the art models are able to beat human experts in terms of accuracy and type I/II errors. However, in the domain of medical applications, models should also be evaluated on how well they capture uncertainty in samples and labels. This aspect is key to building trust in computer-assisted systems, otherwise largely considered to be black boxes by their users. A common practice in supervised learning is to collect multiple evaluations per sample, which is particularly useful when inter-rater agreement is expected to be low. At the same time, model training traditionally uses label fusion, such as majority voting, to produce a single label for each sample. In this paper, we experimentally show that models trained to predict skin conditions become overconfident when this approach is used; i.e. the probability estimates of the model exceeds the true correctness likelihood. Additionally, we show that a better calibrated model is obtained when training with a label sampling scheme that takes advantage of inter-rater variability during training. The calibration improvements come at no cost in terms of model accuracy. Our approach is combined and contrasted with other recent techniques in uncertainty estimation. All experiments are evaluated on a proprietary dataset consisting of 31017 images of skin, where up to 12 experts have diagnosed each image.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    15
    Citations
    NaN
    KQI
    []