From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

#1 From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs [PDF¹] [Copy] [Kimi] [REL]

Authors: Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce \textbf{\emph{LIFT-GS}}, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7\% mAP on open-vocabulary instance segmentation (vs. 20.2\% prior SOTA) and consistent 10-30\% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2×, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: \url{https://liftgs.github.io}.

Subject: ICML.2025 - Poster

w8MCYYAvQD@OpenReview

#1 From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs [PDF1] [Copy] [Kimi] [REL]

#1 From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs [PDF¹] [Copy] [Kimi] [REL]