An Empirical Study of TextRank for Keyword Extraction

2020 
As a typical keyword extraction technology, TextRank has been used in a wide variety of commercial applications, including text classification, information retrieval and clustering. In these applications, the parameters of TextRank, including the co-occurrence window size, iteration number and decay factor, are set roughly, which might affect the effectiveness of returned results. In this work, we conduct an empirical study on TextRank, towards finding optimal parameter settings for keyword extraction. The experiments are done in Hulth2003 and Krapivin2009 datasets, which are two real datasets. We first remove the stop word by an open published English stop word list XPO6. And then, we extract the word stems by Porter Stemmer. Porter Stemmer is a tool which can find the stems of words with multiple variants, discard redundant information, strengthen the filtering effect, and extract the effective features of the text fully. We carry out extensive experiments to evaluate the effects of the parameters to keywords extraction, and evaluate the effectiveness of corresponding results by Precision, Recall and Accuracy. Experimental results show that TextRank shows the best performance when setting co-occurrence window size $w =3$ , iteration number $t = 20$ , decay factor $c = 0.9$ and rank $k =10$ respectively, and the results are independent of the text length.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    48
    References
    6
    Citations
    NaN
    KQI
    []