Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

#1 Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [PDF³] [Copy] [Kimi³] [REL]

Authors: Jiajun Li, Tianze Xu, Xuesong Chen, Xinrui Yao, Shuchang Liu

In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the creation of music, images, and other artistic forms across a wide range of industries. However, current models for image- and video-to-music synthesis struggle to capture the nuanced emotions and atmosphere conveyed by visual content. To fill this gap, we propose Mozart's Touch, a multi-modal music generation framework capable of generating music aligned with cross-modal inputs such as images, videos, and text. The framework consists of three key components: Multi-modal Captioning Module, Large Language Model (LLM) understanding \& Bridging Module, and Music Generation Module. Unlike traditional end-to-end methods, Mozart's Touch uses LLMs to accurately interpret visual elements without requiring the training or fine-tuning of music generation models, providing efficiency and transparency through clear, interpretable prompts. We also introduce the "LLM-Bridge" method to resolve the heterogeneous representation challenges between descriptive texts from different modalities. Through a series of objective and subjective evaluations, we demonstrate that Mozart's Touch outperforms current state-of-the-art models. Our code and examples are available at https://github.com/TiffanyBlews/MozartsTouch.

Subjects: Sound , Artificial Intelligence , Audio and Speech Processing

Publish: 2024-05-05 03:15:52 UTC

2405.02801

#1 Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [PDF3] [Copy] [Kimi3] [REL]

#1 Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models [PDF³] [Copy] [Kimi³] [REL]