RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing

#1 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing [PDF¹] [Copy] [Kimi] [REL]

Authors: Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-12-12 02:33:09 UTC

2512.11234

#1 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing [PDF1] [Copy] [Kimi] [REL]

#1 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing [PDF¹] [Copy] [Kimi] [REL]