Total: 1
In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, \textbf{\ours}, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation---achieving a gFID of \textbf{1.36} on ImageNet benchmarks, while accelerating model convergence by \textbf{three times}, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at \href{https://github.com/CVMI-Lab/VFMTok}{https://github.com/CVMI-Lab/VFMTok}.