Learning with Taxonomies: Classifying Documents and Words

Thomas Hofmann,Lijuan Cai

Learning with Taxonomies: Classifying Documents and Words

2003

Abstract Automatically extracting semantic information about word mean-ing and document topic from text typically involves an extensivenumber of classes. Such classes may represent prede ned wordsenses, topics or document categories and are often organized in ataxonomy. Thelatterencodesimportantinformation,whichshouldbe exploited in learning classi ers from labeled training data. Tothat extent, this paper presents an extension of multiclass SupportVector Machine learning which can incorporate prior knowledgeabout class relationships. The latter can be encoded in the form ofclass attributes, similarities between classes or even a kernel func-tion de ned over the set of classes. The paper also discusses howto specify and optimize meaningful loss functions based on the rel-ative position of classes in the taxonomy. We include experimentalresults for text categorization and for word sense classi cation. 1 Introduction Manyreal-worldclassi cationtasksaremulticlassproblemsinvolvinglargenumbersof classes. This is in particular true for application domains like information re-trieval and natural language processing, where classes may correspond to documentcategories or word senses: several thousand or even tens of thousands of classes arenot uncommon. For instance, the International Patent Classi cation (IPC) scheme[8] consists of approximately 69,000 classes (called groups) that are used to catego-rize patent documents and WordNet 2.0 [3] consists of almost 80,000 word senses(called synsets) de ned by lexicographers to classify the meaning of English nouns.Multiclass problems of this scale pose a severe challenge for learning algorithms andclassi cation accuracies obtained by even the best classi cation methods are oftendisappointingly poor.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations