Yin_From_Slow_Bidirectional_to_Fast_Autoregressive_Video_Diffusion_Models@CVPR2025@CVF

Total: 1

#1 From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [PDF] [Copy] [Kimi] [REL]

Authors: Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang

Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address these limitations by introducing an autoregressive diffusion transformer that is adapted from a pretrained bidirectional video diffusion model. Our key innovations are twofold: First, we extend distribution matching distillation (DMD) to videos, compressing a 50-step denoising process into just 4 steps. Second, we develop an asymmetric distillation approach where a causal student model learns from a bidirectional teacher with privileged future information. This strategy effectively mitigates error accumulation in autoregressive generation, enabling high-quality long-form video synthesis despite training on short clips. Our model achieves a total score of 82.85 on VBench-Long, outperforming all published approaches and, mostly importantly, uniquely enabling fast streaming inference on single GPU at 9.4 FPS. Our method also supports streaming video editing, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.

Subject: CVPR.2025 - Poster