A general framework to expand short text for topic modeling

2017 
A framework to generate pseudo-documents suitable for topic modeling is proposed.An instantiation of the framework based on word vector representations is presented.Results of NPMI and F1 obtained are better than those of state-of-the art methods. Short texts are everywhere in the Web, including messages posted in social media, status messages and blog comments, and uncovering the topics of this type of messages is crucial to a wide range of applications, e.g., context analysis and user characterization. Extracting topics from short text is challenging because of the dependence of conventional methods, such as Latent Dirichlet Allocation, in words co-occurrence, which in short text is rare and make these methods suffer from severe data sparsity. This paper proposes a general framework for topic modeling of short text by creating larger pseudo-document representations from the original documents. In the framework, document components (e.g., words or bigrams) are defined over a metric space, which provides information about the similarity between them. We present two simple, effective and efficient methods that specialize our general framework to create larger pseudo-documents. While the first method considers word co-occurrence to define the metric space, the second relies on distributed word vector representations. The pseudo-documents generated can be given as input to any topic modeling algorithm. Experiments run in seven datasets and compared against state-of-the-art methods for extracting topics by generating pseudo-documents or modifying current topic modeling methods for short text show the methods significantly improve results in terms of normalized pointwise mutual information. A classification task was also used to evaluate the quality of the topics in terms of document representation, where improvements in F1 varied from 1.5 to 15%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    52
    Citations
    NaN
    KQI
    []