Pyramid Self-attention for Semantic Segmentation.

2021 
Self-attention is vital in computer vision since it is the building block of Transformer and can model long-range context for visual recognition. However, computing pairwise self-attention between all pixels for dense prediction tasks (e.g., semantic segmentation) costs high computation. In this paper, we propose a novel pyramid self-attention (PySA) mechanism which can collect global context information far more efficiently. Concretely, the basic module of PySA first divides the whole image into \(R \times R\) regions, and then further divides every region into \(G \times G\) grids. One feature is extracted for each grid and then self-attention is applied to the grid features within the same region. PySA keeps increasing R (e.g., from 1 to 8) to harvest more local context information and propagate global context to local regions in a parallel/series manner. Since G can be kept as a small value, the computation complexity is low. Experiment results suggest that as compared with the traditional global attention method, PySA can reduce the computational cost greatly while achieving comparable or even better performance on popular semantic segmentation benchmarks (e.g., Cityscapes, ADE20k). The project code is released at https://github.com/hustvl/PySA.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []