Rethinking Vision Transformer Depth via Structural Reparameterization

#1 Rethinking Vision Transformer Depth via Structural Reparameterization [PDF³] [Copy] [Kimi¹] [REL]

Authors: Chengwei Zhou, Vipin Chaudhary, Gourav Datta

The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-24 21:28:55 UTC

2511.19718

#1 Rethinking Vision Transformer Depth via Structural Reparameterization [PDF3] [Copy] [Kimi1] [REL]

#1 Rethinking Vision Transformer Depth via Structural Reparameterization [PDF³] [Copy] [Kimi¹] [REL]