Clustering Massive-Categories and Complex Documents via Graph Convolutional Network.

Qingchao Zhao,Jing Yang,Zhengkui Wang,Yan Chu,Wen Shan,Isfaque Al Kaderi Tuhin

Clustering Massive-Categories and Complex Documents via Graph Convolutional Network.

2021

In recent years, a significant amount of text data are being generated on the Internet and in digital applications. Clustering the unlabeled documents becomes an essential task in many areas such as automated document management and information retrieval. A typical approach of document clustering consists of two major steps, where step one extracts proper features to model documents for clustering and step two applies the clustering methods to categorize the documents. Recent research document clustering algorithms are mostly focusing on step one to finding high-quality embedding or vector representation, after which adopting traditional clustering methods for the second step. Or infer the document representation based on the predetermined k clusters. However, the traditional clustering methods are designed with simplistic assumption of the data distribution that fails to cope with the documents with complex distribution and a small number of clusters i.e. , less than 50. In addition to this, the previous need a predetermined k. In this paper, we introduce Graph Convolutional Network into the document clustering (instead of using the traditional clustering methods) and propose a supervised GCN-based document clustering algorithm, DC-GCN which is able to handle documents in noisy, huge and complex distribution by a learnable similarity estimator. Our proposed algorithm first adopts a GCN-based confidence estimator to learn the document position in a cluster via the affinity graph, and then adopts a GCN-based similarity estimator to learn the document similarity by constructing the doc-word graphs integrating the local neighbor documents and its keywords. Based on the confidence and similarity, the document clusters are finally formed. Our experimental evaluations show that DC-GCN achieves 21.88%, 17.35% and 15.58% performance improvement on \(F_p\) over the best baseline algorithms in three different datasets.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations