Recurrent Neural Network Approach for Table Field Extraction in Business Documents

Clement Sage,Alexandre Aussem,Haytham Elghazel,Véronique Eglin,Jérémy Espinas

Recurrent Neural Network Approach for Table Field Extraction in Business Documents

2019

Efficiently extracting information from documents issued by their partners is crucial for companies daily confronted with a huge document flow. Particularly, tables contain most valuable information of business documents but this knowledge is challenging to automatically retrieve as tables from industrial context may have complex and ambiguous physical structure. Bypassing their structure recognition, we propose a generic method for end-to-end table field extraction starting with the list of document tokens segmented by an OCR engine and directly tagging each one of them with one of the possible field type. Similarly as the state-of-the-art methods for non-tabular field extraction, our approach resorts to a token level recurrent neural network combining spatial and textual features of tokens. Involving minimal domain specific knowledge, the method could be easily adapted for information extraction for other document types. We empirically demonstrate the effectiveness of recurrent neural networks for our task by comparing our method with a baseline feedforward neural network having limited context information as inputs. We train and evaluate the two approaches on a dataset of 28,570 real-world purchase orders for which we wish to retrieve the ID numbers and quantities of the ordered products. On this dataset, our method outperforms the baseline with respective micro F1 scores of 0.78 and 0.75 for extracting table fields for document layouts not seen during model training.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations