Spatiotemporal Representation Learning for Blind Video Quality Assessment

2021 
Blind video quality assessment (BVQA) is of great importance for video-related applications, yet still challenging even in this deep learning era. The difficulty lies in the shortage of large-scale labeled data, thus making it hard to train a robust spatiotemporal encoder for BVQA. To relieve such difficulty, we first build a video dataset, which contains over 320K samples suffering from various compression and transmission artifacts. While manually annotating the dataset with subjective perception is much labor-intensive and time-consuming, we adopt reference-based VQA algorithms to weakly label the data automatically. We consider that single weak label is derived from single knowledge, which is deficient and incomplete for VQA. To alleviate the bias from single weak label (i.e., single knowledge) in the weakly labeled dataset, we propose HEterogeneous Knowledge Ensemble (HEKE) for spatiotemporal representation learning. Compared to learning from single knowledge, learning with HEKE is thought to achieve a lower infimum theoretically, and obtain richer representation. On the basis of the built dataset and the HEKE methodology, a feature encoder specific to BVQA is formed, and directly extract spatiotemporal representation from videos. Then, the video quality can be either acquired in a completely BVQA manner without ground truth, or via a finetuning-based regressor with labels. Extensive experiments on various VQA databases show that our BVQA model with the pretrained encoder achieves the state-of-the-art performance. More surprisingly, even trained on the synthetic data, our model still shows competitive performance on authentic databases. The data and source code will be available at https://github.com/Sissuire/BVQA-HEKE.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []