An Efficient Method for Document Categorization Based on Word2vec and Latent Semantic Analysis

2015 
Document categorization is the process of classifying documents from many mixed documents automatically, and the main problem is how to express document content in vector space completely. This paper proposes a new model named Latent Semantic Analysis (LSA) + word2vec to categorize documents. This is the first attempt of combining word2vec with LSA at document categorization and it can map document to vector space under the premise of keeping document contents fully. At first, we create a term by document matrix and the element of which is decided by Term Frequency-Inverse Document Frequency (TF-IDF) weighting and word vector trained by word2vec. This matrix is a 3-dimensional matrix and it can describe the meaning of every word and the content of every document exactly. Secondly, Singular Value Decomposition (SVD) is executed on the matrix and lower computational complexity is gained from this. The model is named LSA + word2vec. Then, document vector gained from the new model are put into Convolutional Neural Network (CNN) to train. CNN is an efficient deep learning algorithm, which improves the accuracy of classification greatly. We evaluate the performance based on the 20newsgroups corpus. The results show that our new model achieves better effects on document categorization tasks, and the accuracy made about 15% improvement than traditional methods, such as LSA and Vector Space Model (VSM).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    17
    Citations
    NaN
    KQI
    []