NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation

#1 NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [PDF⁴] [Copy] [Kimi] [REL]

Authors: 1 and 2 Junyuan Fang, 1 and 3 Zihan Wang, 1 and 3 Yejun Zhang, 1 and 3 Shuzhe Wang, 1 and 3 Iaroslav Melekhov, 1 and 3 Juho Kannala

Vision-language models (VLMs) have demonstrated impressive zero-shot transfer capabilities in image-level visual perception tasks. However, they fall short in 3D instance-level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D-3D optimization or fine-tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training-free strategy seamlessly integrates with prior hard visual prompts, enriching object-descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-04-20 14:39:27 UTC

2504.14638

#1 NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [PDF4] [Copy] [Kimi] [REL]

#1 NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation [PDF⁴] [Copy] [Kimi] [REL]