BurstPU: Classification of Weakly Labeled Datasets with Sequential Bias

2020 
In big data applications from digital health to assisted living smart systems, only a fraction of data instances used for training classifiers t end to be labeled. One important subfield of weakly labeled learning, called Positive Unlabeled (PU) learning, does not require a completely labeled dataset in order to train a strong classifier. This is crucial as in many domains it is expensive or impossible to obtain a completely labeled dataset. While prior PU work assumed that unlabeled instances occurred with a random uniform distribution, we observe that labeled (and unlabeled) data tends to occur in long contiguous sequences (or bursts) due the prevalent burst labeling behavior by human annotators. Burst labeling leads to a sequential bias in PU data not addressed by state-of-the-art methods. To tackle this open problem of learning under sequential bias, we propose BurstPU, the first framework for training a classifier on sequentially labeled PU data. BurstPU addresses the challenge that two interdependent models must be learned, namely, the classification model and the labeling likelihood model, with the later predicting the likelihood that a given instance is labeled. The labeling likelihood model is then needed during the training of the classification model to account for the bias in the labeling process. Our experimental study demonstrates that BurstPU consistently outperforms all state-of-the-art PU methods on a rich variety of diverse real-world datasets, and can learn from fewer labeled instances compared to state-of-art PU methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []