Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding

2021 
Abstract In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    1
    Citations
    NaN
    KQI
    []