Sun_Structured_Policy_Optimization_Enhance_Large_Vision-Language_Model_via_Self-referenced_Dialogue@ICCV2025@CVF

Total: 1

#1 Structured Policy Optimization: Enhance Large Vision-Language Model via Self-referenced Dialogue [PDF] [Copy] [Kimi] [REL]

Authors: Guohao Sun, Can Qin, Yihao Feng, Zeyuan Chen, Ran Xu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

Preference optimization algorithms typically enhance LLM response quality by leveraging human feedback on multiple answers given a fixed instruction. However, these methods often lack capturing the dynamic nature of conversational exchanges. For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) -- a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. The efficacy of SPO is attributed to one key design:treating the questioning and answering as a sequential action and binding them through a trajectory reward. This reward formulation better aligns with real-world dialogue studies and eliminates the need for fixed instructions. We evaluate our models on interleaved benchmarks, including image, multi-image, and video-based understanding and reasoning tasks. Experimental results show that the proposed SPO fine-tuning LVLM with multi-modal preference data can align with human preference more efficiently than DPO.

Subject: ICCV.2025 - Poster