Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

#1 Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition [PDF] [Copy] [Kimi] [REL]

Authors: Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.

Subject: CVPR.2025 - Poster

Wang_Synchronized_Video-to-Audio_Generation_via_Mel_Quantization-Continuum_Decomposition@CVPR2025@CVF

#1 Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition [PDF] [Copy] [Kimi] [REL]