CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding

#1 CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [PDF⁷] [Copy] [Kimi⁷] [REL]

Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.

Subjects: Artificial Intelligence , Computer Vision and Pattern Recognition

Publish: 2025-08-01 07:17:12 UTC

2508.00378

#1 CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [PDF7] [Copy] [Kimi7] [REL]

#1 CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding [PDF⁷] [Copy] [Kimi⁷] [REL]