Total: 1
This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These "rectangular shackles" inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas--critical for applications like urban planning and agricultural monitoring. We introduce the CVOGL-Seg dataset specifically to support and evaluate the new CVOS scheme. To tackle CVOS challenges, we propose Transformer Object Geo-localization (TROGeo), a two-stage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task. Second, the SAM Prompt Stage (SPS) utilizes SAM's zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. Extensive experiments on both CVOGL and CVOGL-Seg datasets demonstrate that our approach achieves state-of-the-art performance, effectively breaking the rectangular shackles and unlocking new possibilities for fine-grained object geo-localization. Our project page: https://zqwlearning.github.io/CVOS.