Chen_RMultiplex200K_Toward_Reliable_Multimodal_Process_Supervision_for_Visual_Language_Models@ICCV2025@CVF

Total: 1

#1 RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications [PDF] [Copy] [Kimi] [REL]

Authors: Sijia Chen, Bin Song

Visual Language Models (VLMs) have achieved remarkable success in many domains due to their ability to perform step-by-step reasoning. However, progress in the telecommunication (Telecom) domain remains limited, primarily due to the lack of high-quality datasets and domain-specific insights. In this paper, we introduce RMultiplex200K, a multimodal dataset designed to present step-wise reasoning rationales and correctness scores for real-world Telecom questions. This enables VLMs to engage in step-level reasoning and verification using multimodal information, thereby facilitating reliable problem-solving. RMultiplex200K is highly scalable as it is constructed without human annotations, relying instead on our automatic plan-based annotation (ApPA) method, which automatically synthesizes reasoning steps labeled with reward scores. With this dataset, we introduce TC-NAVIGATOR, a new mechanism for training multimodal process reward models to serve as reliable reasoning verifiers for VLMs. For instance, the Qwen-2-VL-72B and Llama-3.2-90B models, which initially achieve only 21.3% and 19.8% respectively on practice Telecom questions, reached 48.5% and 46.1% accuracy, respectively, after training with RMultiplex200K and verifying with TC-NAVIGATOR.

Subject: ICCV.2025 - Poster