Multilayer Identification: Combining N-Grams, TF-IDF and Monge-Elkan in Massive Real Time Processing

2019 
In modern societies control is based on information. Nowadays, in many countries, companies are obligated to provide to tax administrations all their invoices and withholders and financial entities to provide information that is used to offer prefilled tax declaration. In the case of Spain, the Tax Agency (AEAT) receives 180 million invoices by month and must process in a few days at the end of January more than 500 millions of registers to prefill Income Tax forms. Hundreds of thousands of these data are not correctly identified by the provider and must be returned to the sender or stored as not identified and analyzed afterwards. Traditionally this process consumed many technical and human resources. AEAT has been able to provide for first time a solution for identification in real time with enormous throughput that fulfil its needs. It is based in a combination of six algorithms, based in three different ideas, n-gram, TI-ILF, and Monge-Elkan that has surpassed any previous expectative.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []