Total: 1
Automating the expert-dependent and labor-intensive task of 3D scene synthesis would significantly benefit fields such as architectural design, robotics simulation, and virtual reality. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors from image generation models. However, current LLMs exhibit limited 3D spatial reasoning, undermining the realism and global coherence of synthesized scenes, while image-generation-based methods often constrain viewpoint control and introduce multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For a more sufficient evaluation on coherence and plausibility, we further introduce First-Person View Score (FPVScore), utilizing a continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios.