Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

#1 Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025 [PDF¹] [Copy] [Kimi] [REL]

Authors: Sujata Gaihre, Amir Thapa Magar, Prasuna Pokharel, Laxmi Tiwari

This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-07-19 09:04:13 UTC

2507.14544

#1 Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025 [PDF1] [Copy] [Kimi] [REL]

#1 Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025 [PDF¹] [Copy] [Kimi] [REL]