CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

#1 CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework [PDF] [Copy] [Kimi¹] [REL]

Authors: Sneha Rao, Shaina Raza, Dhanesh Ramachandram

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

Subject: Artificial Intelligence

Publish: 2026-06-16 18:28:47 UTC

2606.18385

#1 CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework [PDF] [Copy] [Kimi1] [REL]

#1 CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework [PDF] [Copy] [Kimi¹] [REL]