The challenges of German archival document categorization on insufficient labeled data

Fabian Hoppe,Tabea Tietz,Danilo Dessì,Nils Meyer,Mirjam Sprau,Mehwish Alam,Harald Sack

The challenges of German archival document categorization on insufficient labeled data

2020

Fabian Hoppe
Tabea Tietz
Danilo Dessì
Nils Meyer
Mirjam Sprau
Mehwish Alam
Harald Sack

Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task.

Keywords:

Categorization
current vector
Information retrieval
Computer science
archival document
German
Visual approach
Labeled data

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations