Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation

2020 
Word segmentation and named entity annotation are essential foundations for medical text information extraction. This paper focuses on clinical pediatric diseases and takes the existing medical named entities and entity-relationship labeling systems as references. Under the guidance of the Chinese word segmentation and named entity labeling, the specifications for pediatric medical texts have been constructed in this paper. This paper also applies a self-developed distributed annotation platform to pre-annotate and manually proofread the named entities for many times. The corpus consists of 38,805 medical entries which can be divided into nine categories. Among the medical entries, there are 504 entries of common pediatric diseases, 7,085 entries of body parts, 12,907 entries of clinical manifestations, and 4,354 entries of medical procedures. This paper constructs the largest corpus with pediatric medical word segmentation and named entity annotation, which provides a data basis for related research.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    7
    References
    0
    Citations
    NaN
    KQI
    []