R-grams: Unsupervised Learning of Semantic Units in Natural Language.

Ariel Ekgren,Amaru Cuba Gyllensten,Magnus Sahlgren

R-grams: Unsupervised Learning of Semantic Units in Natural Language.

2018

Ariel Ekgren
Amaru Cuba Gyllensten
Magnus Sahlgren

This paper introduces a novel type of data-driven segmented unit that we call r-grams. We illustrate one algorithm for calculating r-grams, and discuss its properties and impact on the frequency distribution of text representations. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

Keywords:

Artificial intelligence
Machine learning
Natural language
Unsupervised learning
Embedding
Computer science
Segmentation
Machine translation
token frequency
Byte
Natural language processing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations