A pre-training technique to localize medical BERT and enhance BioBERT.

2020 
Bidirectional Encoder Representations from Transformers (BERT) models for biomedical specialties such as BioBERT and clinicalBERT have significantly improved in biomedical text-mining tasks and enabled us to extract valuable information from biomedical literature. However, we benefitted only in English because of the significant scarcity of high-quality medical documents, such as PubMed, in each language. Therefore, we propose a method that realizes a high-performance BERT model by using a small corpus. We introduce the method to train a BERT model on a small medical corpus both in English and Japanese, respectively, and then we evaluate each of them in terms of the biomedical language understanding evaluation (BLUE) benchmark and the medical-document-classification task in Japanese, respectively. After confirming their satisfactory performances, we apply our method to develop a model that outperforms the pre-existing models. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) achieves the best scores on 7 of the 10 datasets in terms of the BLUE benchmark. The total score is 1.0 points above that of BioBERT.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    3
    Citations
    NaN
    KQI
    []