Context-Based Clustering of Assamese Words using N-gram Model

M P Bhuyan,S. K. Sarma,P. Sarma

Context-Based Clustering of Assamese Words using N-gram Model

2021

The popularity of mobile devices and the availability of the Internet increase the use of various online platforms for chatting and communicating with others. Due to the use of such platforms, the use of local languages is also increasing because everyone feels comfortable with his/her mother tongue. In this research work, clustering of Assamese words is done by using N-gram models. This clustering is context-based and preserves the contextual information of the words. This Assamese word clustering will help in the development of the Assamese language in Parts of Speech tagging, spell checker, word prediction, etc. In this research work, word clusters are designed using bigram, trigram, and quadrigram models. Words that appear in a particular context are stored in the same list of clusters. A corpus of size almost 600K words is used to design the Assamese word clusters. A similarity score is calculated between two words to keep them in the same cluster. The accuracy of the clustering system is around 60%, and 733 different clusters are extracted with a maximum of 15 words in a cluster.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations