Recurrent Neural Network Approach for Table Field Extraction in Business Documents
2019
Efficiently extracting information from documents
issued by their partners is crucial for companies daily confronted
with a huge document flow. Particularly, tables contain most
valuable information of business documents but this knowledge
is challenging to automatically retrieve as tables from industrial
context may have complex and ambiguous physical structure.
Bypassing their structure recognition, we propose a generic
method for end-to-end table field extraction starting with the
list of document tokens segmented by an OCR engine and
directly tagging each one of them with one of the possible field
type. Similarly as the state-of-the-art methods for non-tabular
field extraction, our approach resorts to a token level recurrent
neural network combining spatial and textual features of tokens.
Involving minimal domain specific knowledge, the method could
be easily adapted for information extraction for other document
types. We empirically demonstrate the effectiveness of recurrent
neural networks for our task by comparing our method with
a baseline feedforward neural network having limited context
information as inputs. We train and evaluate the two approaches
on a dataset of 28,570 real-world purchase orders for which we
wish to retrieve the ID numbers and quantities of the ordered
products. On this dataset, our method outperforms the baseline
with respective micro F1 scores of 0.78 and 0.75 for extracting
table fields for document layouts not seen during model training.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
18
References
13
Citations
NaN
KQI