Unsupervised feature selection for text classification via word embedding

2016 
The key of big text documents data analysis is to classify those text documents. To classify those text documents, it is necessary to represent those text documents as vectors which is vector space model (VSM). A powerful vector space model should remain the classification information with dimensions as little as possible. To achieve that, it is important to select most effective features for text classification. Unlike the supervised selection method which utilizes the category information in the training data, we propose an unsupervised feature selection method. Our method requires no category information which makes our method has more application scenarios as the labeled data is expensive and inaccurate. Unlike other unsupervised methods, our method utilizes word embedding to find the words with similar semantic meaning. The word embedding maps the words into vectors and remains the semantic relationships between words. We select the most representative word on behalf of the words with similar semantic meaning because it is redundant to include all those words as features. We demonstrate on Reuters-21578 dataset, that our method outperforms other methods. Especially, our method has great advantage when select limited features.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    6
    Citations
    NaN
    KQI
    []