S5coB5kqSD@OpenReview

Total: 1

#1 VeXKD: The Versatile Integration of Cross-Modal Fusion and Knowledge Distillation for 3D Perception [PDF] [Copy] [Kimi1] [REL]

Authors: JI Yuzhe, Yijie CHEN, Liuqing Yang, Rui Ding, Meng Yang, Xinhu Zheng

Recent advancements in 3D perception have led to a proliferation of network architectures, particularly those involving multi-modal fusion algorithms. While these fusion algorithms improve accuracy, their complexity often impedes real-time performance. This paper introduces VeXKD, an effective and Versatile framework that integrates Cross-Modal Fusion with Knowledge Distillation. VeXKD applies knowledge distillation exclusively to the Bird's Eye View (BEV) feature maps, enabling the transfer of cross-modal insights to single-modal students without additional inference time overhead. It avoids volatile components that can vary across various 3D perception tasks and student modalities, thus improving versatility. The framework adopts a modality-general cross-modal fusion module to bridge the modality gap between the multi-modal teachers and single-modal students. Furthermore, leveraging byproducts generated during fusion, our BEV query guided mask generation network identifies crucial spatial locations across different BEV feature maps in a data-driven manner, significantly enhancing the effectiveness of knowledge distillation. Extensive experiments on the nuScenes dataset demonstrate notable improvements, with up to 6.9\%/4.2\% increase in mAP and NDS for 3D detection tasks and up to 4.3\% rise in mIoU for BEV map segmentation tasks, narrowing the performance gap with multi-modal models.

Subject: NeurIPS.2024 - Poster