Automatic de-identification of electronic health records : an Australian perspective

2013 
We present an approach to automatically de-identify health records. In our approach, personal health information is identified using a Conditional Random Fields machine learning classifier, a large set of linguistic and lexical features, and pattern matching techniques. Identified personal information is then removed from the reports. The de-identification of personal health information is fundamental for the sharing and secondary use of electronic health records, for example for data mining and disease monitoring. The effectiveness of our approach is first evaluated on the 2007 i2b2 Shared Task dataset, a widely adopted dataset for evaluating de-identification techniques. Subsequently, we investigate the robustness of the approach to limited training data; we study its effectiveness on different type and quality of data by evaluating the approach on scanned pathology reports from an Australian institution. This data contains optical character recognition errors, as well as linguistic conventions that differ from those contained in the i2b2 dataset, for example different date formats. The findings suggest that our approach compares to the best approach from the 2007 i2b2 Shared Task; in addition, the approach is found to be robust to variations of training size, data type and quality in presence of sufficient training data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    3
    Citations
    NaN
    KQI
    []