AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

#1 AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [PDF] [Copy] [Kimi] [REL]

Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self-attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive evaluations demonstrate that AV-Link achieves substantial improvements in audio-video synchronization, outperforming more expensive baselines such as MovieGen V2A model.

Subject: ICCV.2025 - Poster

Haji-Ali_AV-Link_Temporally-Aligned_Diffusion_Features_for_Cross-Modal_Audio-Video_Generation@ICCV2025@CVF

#1 AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation [PDF] [Copy] [Kimi] [REL]