Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

#1 Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting [PDF²] [Copy] [Kimi¹] [REL]

Authors: Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-24 11:55:22 UTC

2511.19021

#1 Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting [PDF2] [Copy] [Kimi1] [REL]

#1 Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting [PDF²] [Copy] [Kimi¹] [REL]