Topic Modeling Based on ICD Codes for Clinical Documents

2021 
We proposed two ICD-based topic modeling methods, named ICD-1 and ICD-2, which can generate topics based on the International Classification of Diseases (ICD) codes assigned to the documents. We applied the two methods to the Pittsburgh EHR dataset. For comparison, we also ran LDA on the same dataset to generate topics. Then we experimented with the three topic models on both document retrieval and sentence retrieval. As a baseline, we performed both retrievals using a keyword-matching method named TF-IDF. We evaluated the results using three methods: precision at ten (P@10), document ranking correlation, and sentence relevance determination (in terms of precision, recall, and F-score), which were based on the review and annotation made on the retrieved documents by two medical experts. In the P@10 evaluation, ICD-2 method achieved the highest average P@10 value of 0.61. In document ranking correlation, ICD-1 method achieved the highest Pearson’s correlation coefficient of 0.709. In sentence relevance determination, ICD-1 method achieved the highest F-score of 0.655. Overall, the ICD-based methods outperformed LDA and TF-IDF in the experiment.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []