NetClass: A network-based relational model for document classification

2018 
Abstract Aiming to handle the complexity inherent to the human textual communication, Automatic Document Classification (ADC) methods often adopt several simplifications. One such simplification is to consider independent the terms that compose documents, which may hide important relationships between them. These relationships can encapsulate non-trivial and effective patterns to improve classification effectiveness. In this work, we propose NetClass, a new network-based model for documents that explicitly considers term relationships and introduce a family of relational algorithms for ADC, such as the LRN-WRN classifier—a lazy relational ADC algorithm that not only exploits relationships between terms but also neighborhood information. As our extensive experimental evaluation shows, the proposed LRN-WRM achieves competitive performance when compared to the state-of-the-art in ADC, including SVM, considering seven distinct domains. More specifically, LRN-WRN outperforms state-of-the-art classifiers in 5 out of 7 domains, being within the top-2 best-performing classifier in all assessed domains. Our evaluation highlights the high effectiveness of our proposal, as well as its efficiency in terms of runtime. Indeed, besides effectiveness and efficiency, the simplicity and the absence of a complex parameter tuning of our proposal are key characteristics that make our algorithms interesting alternatives for ADC. Particularly, as highlighted by our experimental evaluation, LRN-WRM was shown to be a promising alternative to dynamic domains with a huge volume of short texts (e.g., social media content) or with several classes.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    1
    Citations
    NaN
    KQI
    []