Native Segmentation Vision Transformers

#1 Native Segmentation Vision Transformers [PDF] [Copy] [Kimi] [REL]

Authors: Guillem Braso, Aljosa Osep, Laura Leal-Taixé

Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises *natively* in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of *native*, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks.

Subject: NeurIPS.2025 - Poster

V7RRnsAlbY@OpenReview

#1 Native Segmentation Vision Transformers [PDF] [Copy] [Kimi] [REL]