MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

#1 MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [PDF¹⁰] [Copy] [Kimi²] [REL]

Authors: Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu, Qiang Xu

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to process different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the text-to-motion semantic pre-training, followed by the multimodal low-level control adaptation. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

Subject: AAAI.2025 - Computer Vision

32183@AAAI

#1 MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [PDF10] [Copy] [Kimi2] [REL]

#1 MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [PDF¹⁰] [Copy] [Kimi²] [REL]