SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

#1 SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [PDF³] [Copy] [Kimi²] [REL]

Authors: Guankun Wang, Wenjin Mo, Junyi Wang, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren

Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of Vid-LLMs specialized for fine-grained surgical video understanding tasks, which is crucial for analyzing specific processes or details within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K dataset which consists of over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Furthermore, we introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos. We also develop the Multi-frequency Fusion Attention to effectively integrate low and high-frequency visual tokens, ensuring the retention of critical information. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing complex procedural contexts.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-06-22 02:16:18 UTC

2506.17873

#1 SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [PDF3] [Copy] [Kimi2] [REL]

#1 SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [PDF³] [Copy] [Kimi²] [REL]