OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding

#1 OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Tianrun Xu, Guanyu Chen, Ye Li, Yuxin Xi, Zeyu Mu, Ruichen Wang, Tianren Zhang, Haichuan Gao, Feng Chen

Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue. However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model's own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a dataset based on 1.4M image-task instances. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets are available at https://github.com/tinnel123666888/OURO.git.

Subject: ICCV.2025 - Poster

Xu_OURO_A_Self-Bootstrapped_Framework_for_Enhancing_Multimodal_Scene_Understanding@ICCV2025@CVF

#1 OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding [PDF] [Copy] [Kimi] [REL]