Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries

#1 Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries [PDF] [Copy] [Kimi¹] [REL]

Authors: Serkan Sulun, Paula Viana, Matthew E. P. Davies

We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.

Subjects: Sound , Artificial Intelligence , Machine Learning , Multimedia , Audio and Speech Processing , Image and Video Processing

Publish: 2025-02-14 13:32:59 UTC

2502.10154

#1 Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries [PDF] [Copy] [Kimi1] [REL]

#1 Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries [PDF] [Copy] [Kimi¹] [REL]