Recurrent Neural Network Approach for Table Field Extraction in Business Documents

2019 
Efficiently extracting information from documents issued by their partners is crucial for companies daily confronted with a huge document flow. Particularly, tables contain most valuable information of business documents but this knowledge is challenging to automatically retrieve as tables from industrial context may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction starting with the list of document tokens segmented by an OCR engine and directly tagging each one of them with one of the possible field type. Similarly as the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features of tokens. Involving minimal domain specific knowledge, the method could be easily adapted for information extraction for other document types. We empirically demonstrate the effectiveness of recurrent neural networks for our task by comparing our method with a baseline feedforward neural network having limited context information as inputs. We train and evaluate the two approaches on a dataset of 28,570 real-world purchase orders for which we wish to retrieve the ID numbers and quantities of the ordered products. On this dataset, our method outperforms the baseline with respective micro F1 scores of 0.78 and 0.75 for extracting table fields for document layouts not seen during model training.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    13
    Citations
    NaN
    KQI
    []