Wang_VISO_Accelerating_In-orbit_Object_Detection_with_Language-Guided_Mask_Learning_and@ICCV2025@CVF

Total: 1

#1 VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference [PDF] [Copy] [Kimi] [REL]

Authors: Meiqi Wang, Han Qiu

In-orbit object detection is essential for Earth observation missions on satellites equipped with GPUs. A promising approach is to use pre-trained vision-language modeling (VLM) to enhance its open-vocabulary capability. However, adopting it on satellites poses two challenges: (1) satellite imagery differs substantially from natural images, and (2) satellites' embedded GPUs are insufficient for complex models' inference. We reveal their lack of a crucial prior: in-orbit detection involves identifying a set of known objects within a cluttered yet monotonous background. Motivated by this observation, we propose VISO, a Vision-language Instructed Satellite Object detection model that focuses on object-specific features while suppressing irrelevant regions through language-guided mask learning. After pre-training on a large-scale satellite dataset with 3.4M region-text pairs, VISO enhances object-text alignment and object-centric features to improve detection accuracy. Also, VISO suppresses irrelevant regions, enabling highly sparse inference to accelerate speed on satellites. Extensive experiments show that VISO without sparsity outperforms state-of-the-art (SOTA) VLMs in zero-shot detection by increasing 34.1% AP and reducing 27xFLOPs, and surpasses specialist models in supervised object detection and object referring by improving 2.3% AP. When sparsifying VISO to a comparable AP, FLOPs can be greatly reduced by up to 8.5x. Real-world tests reveal that VISO achieves a 2.8-4.8xFPS speed-up on satellites' embedded GPUs.

Subject: ICCV.2025 - Poster