Zemskova_3DGraphLLM_Combining_Semantic_Graphs_and_Large_Language_Models_for_3D@ICCV2025@CVF

Total: 1

#1 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding [PDF] [Copy] [Kimi1] [REL]

Authors: Tatiana Zemskova, Dmitry Yudin

A 3D scene graph represents a compact scene model by capturing both the objects present and the semantic relationships between them, making it a promising structure for robotic applications. To effectively interact with users, an embodied intelligent agent should be able to answer a wide range of natural language queries about the surrounding 3D environment. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for learning scene representations have shown that adapting these representations to the 3D world can significantly improve the quality of LLM responses. However, existing methods typically rely only on geometric information, such as object coordinates, and overlook the rich semantic relationships between objects. In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. This representation is used as input to LLMs for performing 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate that our approach outperforms baselines that do not leverage semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

Subject: ICCV.2025 - Poster