Characterising text mining: a systematic mapping review of the Portuguese language.

2017 
Documents written in natural language constitute a major part of the artefacts produced during the software engineering life cycle. Studies indicate that more than 80% of enterprise data is stored in some sort of unstructured form, mainly as text. Therefore, the growth of user-generated content, especially from social media, provides a huge amount of data which allows discovering the experiences, opinions, and feelings of users. Text mining refers to the set of tools, techniques, and algorithms adopted to extract useful information from unstructured data. Considering that Portuguese ranks among the ten most spoken languages, and it is the second most common in Twitter, this study aims to map current primary studies that relate to the application of text mining for Portuguese. A systematic mapping method was applied and 6075 primary studies were retrieved up to the year 2014. A total of 203 studies were included, from which more than 60% analyse texts written in Brazilian variant. The majority of studies focus on the text classification task. Support vector machine and Naive Bayes appear as main the algorithms. Folha de Sao Paulo and Publico newspapers appear as main corpora, followed by the Portuguese Attorney General's Office corpus and Twitter.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    6
    Citations
    NaN
    KQI
    []