Structuring clinical text with AI: old vs. new natural language processing techniques evaluated on eight common cardiovascular diseases

2021 
Mining the structured data in electronic health records(EHRs) enables many clinical applications while the information in free-text clinical notes often remains untapped. Free-text notes are unstructured data harder to use in machine learning while structured diagnostic codes can be missing or even erroneous. To improve the quality of diagnostic codes, this work extracts structured diagnostic codes from the unstructured notes concerning cardiovascular diseases. Five old and new word embeddings were used to vectorize over 5 million progress notes from Stanford EHR and logistic regression was used to predict eight ICD-10 codes of common cardiovascular diseases. The models were interpreted by the important words in predictions and analyses of false positive cases. Trained on Stanford notes, the model transferability was tested in the prediction of corresponding ICD-9 codes of the MIMIC-III discharge summaries. The word embeddings and logistic regression showed good performance in the diagnostic code extraction with TF-IDF as the best word embedding model showing AU-ROC ranging from 0.9499 to 0.9915 and AUPRC ranging from 0.2956 to 0.8072. The models also showed transferability when tested on MIMIC-III data set with AUROC ranging from 0.7952 to 0.9790 and AUPRC ranging from 0.2353 to 0.8084. Model interpretability was showed by the important words with clinical meanings matching each disease. This study shows the feasibility to accurately extract structured diagnostic codes, impute missing codes and correct erroneous codes from free-text clinical notes with interpretable models for clinicians, which helps improve the data quality of diagnostic codes for information retrieval and downstream machine-learning applications.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    3
    Citations
    NaN
    KQI
    []