Unsupervised Decomposition of Multi-Author Document

Kautsya Kanu,Sayantan Sengupta

Unsupervised Decomposition of Multi-Author Document

2019

This paper proposes an improvement over a paper[A generic unsupervised methods for decomposing multi-author documents, N. Akiva and M. Koppel 2013]. We have worked on two aspects, In the first aspect, we try to capture writing style of author by ngram model of words, POS Tags and PQ Gram model of syntactic parsing over used basic uni-gram model. In the second aspect, we added some layers of refinements in existing baseline model and introduce new term ”similarity index” to distinguish between pure and mixed segments before unsupervised labeling. Similarity index uses overall and sudden change of writing style by PQ Gram model and words used using n-gram model between lexicalised/unlexicalised sentences in segments for refinement. In this paper, we investigate the role of feature selection that captures the syntactic patterns specific to an author and its overall effect in the final accuracy of the baseline system. More specifically, we insert a layer of refinement to the baseline system and define a threshold based on the similarity measure among the sentences to consider the purity of the segments to be given as input to the GMM.The key idea of our approach is to provide theGMMclustering with the ”good segments” so that the clustering precision is maximised which is then used as labels to train a classifier. We also try different features set like bigrams and trigrams of POS tags and an PQ Grams based feature on unlexicalised PCFG to capture the distinct writing styles which is then given as an input to a GMM trained by iterative EM algorithm to generate good clusters of the segments of the merged document.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations