Automatic Table-of-Contents Generation for Efficient Information Access

2020 
This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    0
    Citations
    NaN
    KQI
    []