A Comparative Analysis of Latent Variable Models for Web Page Classification

István Bíró,András A. Benczúr,Jácint Szabó,Ana Gabriela Maguitman

A Comparative Analysis of Latent Variable Models for Web Page Classification

2008

István Bíró
András A. Benczúr
Jácint Szabó
Ana Gabriela Maguitman

A main challenge for Web content classification is how to model the input data. This paper discusses the application of two text modeling approaches, latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), in the Web page classification task. We report results on a comparison of these two approaches using different vocabularies consisting of links and text. Both models are evaluated using different numbers of latent topics. Finally, we evaluate a hybrid latent variable model that combines the latent topics resulting from both LSA and LDA. This new approach turns out to be superior to the basic LSA and LDA models. In our experiments with categories and pages obtained from the ODP Web directory the hybrid model achieves an averaged F-measure value of 0.852 and an averaged ROC value of 0.96.

Keywords:

Web directory
Computer science
Latent semantic analysis
Data mining
Web page
Latent Dirichlet allocation
Latent variable model
Probabilistic latent semantic analysis
Latent variable
Artificial intelligence
Text mining
Pattern recognition
Web content

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations