Technical Note: An Embedding-based Medical Note De-identification Approach with Sparse Annotation.

2020 
Purpose Medical note de-identification is critical for the protection of private information and the security of data sharing in collaborative research. The task demands the complete removal of all patient names and other sensitive information such as addresses and phone numbers from medical records. Accomplishing this goal is challenging, with many variations in the medical note formats and string representations. Existing de-identification approaches include pattern matching where extensive dictionary lists are constructed a prior; and entity tagging, which trains on a large word-wise annotated corpus. This motivates us to study an alternative to the existing approaches with a reduced annotation burden. Methods In this work, we propose a novel approach that implicitly accounts for the language territory of sensitive information. Specifically, our approach incorporates a contextualized word embedding module and a multilayer perceptron to simultaneously infer the similarity of sensitive and non-sensitive vocabularies to a constructed landmark set, providing an overall sparsely supervised classification. To demonstrate the rationale, we present the principle and work pipeline with the task of name removal, but the proposed method applies to other strings as well. Results On a large cohort of hybrid clinical reports, including various forms of consulting, on-treatment-visit, and follow-up notes, we achieved > 0.99 accuracies in our constructed training, validation, and testing sets. The sensitivity and specificity were 1.0 and 0.9973 respectively for two randomly selected reports, comparing favorably to the benchmark Stanford NER tagger, which achieved 0.8529 and 0.9969. The F1 score was 0.889 ± 0.046 and 0.822 ± 0.103 across six randomly selected reports for the proposed method and the Stanford NER, respectively, and the result was significant under a one-sided t-test with alpha = 0.1. Conclusion Our qualitative and quantitative analysis shows that our method achieved better results than the pre-trained 3-class Stanford NER toolbox.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []