Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

#1 Learnings from Scaling Visual Tokenizers for Reconstruction and Generation [PDF] [Copy] [Kimi¹] [REL]

Authors: Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.

Subject: ICML.2025 - Poster

MumOAOs9HY@OpenReview

#1 Learnings from Scaling Visual Tokenizers for Reconstruction and Generation [PDF] [Copy] [Kimi1] [REL]

#1 Learnings from Scaling Visual Tokenizers for Reconstruction and Generation [PDF] [Copy] [Kimi¹] [REL]