Dai_Latent_Swap_Joint_Diffusion_for_2D_Long-Form_Latent_Generation@ICCV2025@CVF

Total: 1

#1 Latent Swap Joint Diffusion for 2D Long-Form Latent Generation [PDF] [Copy] [Kimi] [REL]

Authors: Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Kewei Li, Jun Du, Lei Sun, Jianqing Gao, Ruoyu Wang, Jiefeng Ma

This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherent long spectrum and panorama using a latent swap joint diffusion process across multi-views. We first investigate spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components due to the step-wise averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap operator, applied to the overlapping region of adjacent views. Leveraging step-wise differentiated trajectories, this swap operator avoids spectrum distortion and adaptively enhances high-frequency components. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we canachieve a balance between cross-view similarity and diversity in a feed-forward manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models. It also adapts well to panorama generation, achieving comparable performance with a 2 xto 20 xspeedup. The project website is available at https://swapforward.github.io.

Subject: ICCV.2025 - Poster