Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation

#1 Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation [PDF] [Copy] [Kimi] [REL]

Authors: Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang

Driven by the applications in autonomous driving, robotics, and augmented reality, 3D object annotation is a critical task compared to 2D annotation, such as spatial complexity, occlusion, and viewpoint inconsistency. The existing methods relying on single models often struggle with these issues. In this paper, we introduce Tri-MARF, a novel framework that integrates tri-modal inputs (i.e., 2D multi-view images, text descriptions, and 3D point clouds) with multi-agent collaboration to enhance the 3D annotation process. Our Tri-MARF consists of three specialized agents: a vision-language model agent that generates multi-view descriptions, an information aggregation agent that selects optimal descriptions, and a gating agent that aligns text descriptions with 3D geometries for more refined captioning. Extensive experiments on the Objaverse-LVIS, Objaverse-XL, and ABO datasets demonstrate the superiority of our Tri-MARF, which achieves a CLIPScore of 88.7 (compared to 78.6–82.4 for other SOTA methods), retrieval accuracy of 45.2/43.8 (ViLT R@5), and an impressive throughput of 12,000 objects per hour on a single NVIDIA A100 GPU.

Subject: NeurIPS.2025 - Poster

YGIbwfNWot@OpenReview

#1 Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation [PDF] [Copy] [Kimi] [REL]