Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

#1 Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding [PDF¹] [Copy] [Kimi] [REL]

Authors: Guofeng Mei, Bin Ren, Juan Liu, Luigi Riz, Xiaoshui Huang, Xu Zheng, Yongshun Gong, Ming-Hsuan Yang, Nicu Sebe, Fabio Poiesi

Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-05-24 18:26:30 UTC

2505.18819

#1 Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding [PDF1] [Copy] [Kimi] [REL]

#1 Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding [PDF¹] [Copy] [Kimi] [REL]