Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration

#1 Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration [PDF] [Copy] [Kimi] [REL]

Authors: Hongkang Zhang, Shao-Lun Huang, Ercan Engin KURUOGLU, Yanlong Wang

Significant progress has been achieved using Vision Transformers (ViTs) in computer vision. However, challenges persist in modeling multi-scale spatial relationships, hindering effective integration of fine-grained local details and long-range global dependencies. To address this limitation, a Multi-Kernel Correlation-Attention Vision Transformer (MK-CAViT) grounded in the Hirschfeld-Gebelein-Rényi (HGR) theory was proposed, introducing three key innovations. A parallel multi-kernel architecture was utilized to extract multi-scale features through small, medium, and large kernels, overcoming the single-scale constraints of conventional ViTs. The cross-scale interactions were enhanced through the Fast-HGR attention mechanism, which models nonlinear dependencies and applies adaptive scaling to weigh connections and refine contextual reasoning. Additionally, a stable multi-scale fusion strategy was adopted, integrating dynamic normalization and staged learning to mitigate gradient variance, progressively fusing local and global contexts, and improving training stability. The experimental results on ImageNet, COCO, and ADE20K validated the superiority of MK-CAViT in classification, detection, and segmentation, surpassing state-of-the-art baselines in capturing complex spatial relationships while maintaining efficiency. These contributions can establish a theoretically grounded framework for visual representation learning and address the longstanding limitations of ViTs.

Subject: NeurIPS.2025 - Poster

64WeVllQjq@OpenReview

#1 Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration [PDF] [Copy] [Kimi] [REL]