V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

#1 V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Yuan Wang, Jiaxiang Liu, Shujian Gao, Bin Feng, Zhihang Tang, Xiaotang Gai, Jian Wu, Zuozhu Liu

Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.

Subject: Computational Engineering, Finance, and Science

Publish: 2025-06-24 13:23:25 UTC

2506.19610

#1 V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis [PDF1] [Copy] [Kimi1] [REL]

#1 V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis [PDF¹] [Copy] [Kimi¹] [REL]