Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

#1 Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [PDF⁶] [Copy] [Kimi] [REL]

Authors: Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang

Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-25 13:09:03 UTC

2511.20280

#1 Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [PDF6] [Copy] [Kimi] [REL]

#1 Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement [PDF⁶] [Copy] [Kimi] [REL]