8UFG9D8xeU@OpenReview

Total: 1

#1 Direct Post-Training Preference Alignment of Multi-Agent Motion Generation Model with Implicit Feedback from Demonstrations [PDF4] [Copy] [Kimi7] [REL]

Authors: Thomas Tian, Kratarth Goel

Recent advancements in Large Language Models (LLMs) have transformed motion generation models in embodied applications such as autonomous driving and robotic manipulation. While LLM-type motion models benefit from scalability and efficient formulation, there remains a discrepancy between their token-prediction imitation objectives and human preferences. This often results in behaviors that deviate from human-preferred demonstrations, making post-training behavior alignment crucial for generating human-preferred motions. Post-training alignment requires a large number of preference rankings over model generations, which are costly and time-consuming to annotate in multi-agent motion generation settings. Recently, there has been growing interest in using expert demonstrations to scalably build preference data for alignment. However, these methods often adopt a worst-case scenario assumption, treating all generated samples from the reference model as unpreferred and relying on expert demonstrations to directly or indirectly construct preferred generations. This approach overlooks the rich signal provided by preference rankings among the model's own generations. In this work, instead of treating all generated samples as equally unpreferred, we propose a principled approach leveraging the implicit preferences encoded in expert demonstrations to construct preference rankings among the generations produced by the reference model, offering more nuanced guidance at low-cost. We present the first investigation of direct preference alignment for multi-agent motion token-prediction models using implicit preference feedback from demonstrations. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of generated behaviors involving up to 128 agents, making a 1M token-prediction model comparable to state-of-the-art large models by relying solely on implicit feedback from demonstrations, without requiring additional human annotations or high computational costs. Furthermore, we provide an in-depth analysis of preference data scaling laws and their effects on over-optimization, offering valuable insights for future investigations.

Subject: ICLR.2025 - Spotlight