Character segmentation and transcription system for historical Japanese books with a self-proliferating character image database

2017 
This paper describes an interactive system for assisting transcription work for digitized historical woodblock-printed Japanese books published in the seventeenth to nineteenth centuries. The main functions of the system include layout analysis, character segmentation, transcription, and the generation of a character image database. The procedures for using the system consist of two major phases. In the first phase, the system automatically produces provisional character segmentation data, and users interactively edit the segmentation results and transcribe them into text data. Information obtained in this phase is stored in the character image database. In the second phase, the system performs automatic character segmentation and transcription by using the database generated in the first phase. Through repeated applications of these two phases to a variety of materials, the contents of the character image database will be enhanced, and the system’s performance in character segmentation and transcription will increase accordingly. Since the scheme looks like the fact that the parents produce their children and the children produce grandchildren and so on, successively, this database is called as self-proliferating database. The experiment showed that when the number of character images in the database increased, the transcription accuracy also increased accordingly. In the experiment, when the size of the database increased to 37,000, the segmentation accuracy reached 83.7%, whereas the transcription accuracy reached 69.1%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    4
    Citations
    NaN
    KQI
    []