A Review of Active Learning and Co-Training in Text Classification

2005 
Ever increasing volumes of data in electronic format are being produced and stored, due mainly to the relatively inexpensive cost of storage and emerging techniques for extracting useful information from such data. A large corpus of data, in itself does not offer a lot of value but if it were categorised into relevant, prescribed categories, this data could become quite useful. Text classification is the name given to automated techniques for grouping textual information into categories. The task of categorisation lends itself fully to automation. Large corpora of data require administration and a technique to effectively manage the data, since the rate at which data is produced is far in greater that the rate at which it can be manually processed. Manual categorisation of data is feasible only in the small scale. Even a small organisation can now produce large amounts of data. Modern machine learning techniques (supervised learning) allow for a classifier to be produced using only labelled examples from concept that is required to be learned. Previously, considerable human effort was required to manually construct and maintain hand-crafted rules which formed the classifier. Modern machine learning techniques has allowed for off the shelf learners which given enough training data can construct an accurate classifier. Large quantities of training data are required for an accurate classifiers to be produced. However, obtaining labelled training examples can often be an expensive task in itself. Typically, examples need to be manually labelled which is a laborious, time consuming and repetitive task. Most humans do not relish the idea of labelling thousands of examples. In many domains labelled training data is either scarce or expensive to produce. Unlabelled data, on the other hand, is plentiful and inexpensive to collect. The learner should make use of this unlabelled data to help it produce an accurate classifier without the need for large amounts of labelled training data. Active learning and Co-Training are methods which require very little labelled training data and exploit unlabelled data in order to increase accuracy
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    6
    Citations
    NaN
    KQI
    []