VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

#1 VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [PDF] [Copy] [Kimi] [REL]

Authors: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu

Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models.To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations.Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4\% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r $>$ 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs.Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.3\% accuracy for a 7B VL-GenRM).We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

Subject: CVPR.2025 - Highlight

Li_VL-RewardBench_A_Challenging_Benchmark_for_Vision-Language_Generative_Reward_Models@CVPR2025@CVF

#1 VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [PDF] [Copy] [Kimi] [REL]