2508.11196

Total: 1

#1 UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning [PDF2] [Copy] [Kimi2] [REL]

Authors: Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China Jiajin Guan, School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu, China Haibo Mei, Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China Bonan Zhang, Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China Dan Liu, Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China Yuanshuang Fu, School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, Chengdu, China Yue Zhang

Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-08-15 04:06:40 UTC