wq5G71w7Zx@OpenReview

Total: 1

#1 Sparse Image Synthesis via Joint Latent and RoI Flow [PDF] [Copy] [Kimi] [REL]

Authors: Ziteng Gao, Jay Zhangjie Wu, Mike Zheng Shou

Natural images often exhibit underlying sparse structures, with information density varying significantly across different spatial locations. However, most generative models rely on dense grid-based pixels or latents, neglecting this inherent sparsity. In this paper, we explore modeling visual generation paradigm via sparse non-grid latent representations. Specifically, we design a sparse autoencoder that represents an image as a small number of latents with their positional properties (i.e., regions of interest, RoIs) with high reconstruction quality. We then explore training flow-matching transformers jointly on non-grid latents and RoI values. To the best knowledge, we are the first to address spatial sparsity using RoIs in generative process. Experimental results show that our sparse flow-based transformers have competitive performance compared with dense grid-based counterparts with significantly reduced lower compute, and reaches a competitive 2.76 FID with just 64 latents on class-conditional ImageNet $256\times 256$ generation.

Subject: NeurIPS.2025 - Poster