ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task

#1 ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task [PDF²] [Copy] [Kimi²] [REL]

Authors: Ahmad Khalil, Mahmoud Khalil, Alioune Ngom

In this paper, we introduce ResNetVLLM (ResNet Vision LLM), a novel cross-modal framework for zero-shot video understanding that integrates a ResNet-based visual encoder with a Large Language Model (LLM. ResNetVLLM addresses the challenges associated with zero-shot video models by avoiding reliance on pre-trained video understanding models and instead employing a non-pretrained ResNet to extract visual features. This design ensures the model learns visual and semantic representations within a unified architecture, enhancing its ability to generate accurate and contextually relevant textual descriptions from video inputs. Our experimental results demonstrate that ResNetVLLM achieves state-of-the-art performance in zero-shot video understanding (ZSVU) on several benchmarks, including MSRVTT-QA, MSVD-QA, TGIF-QA FrameQA, and ActivityNet-QA.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-04-20 00:20:18 UTC

2504.14432

#1 ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task [PDF2] [Copy] [Kimi2] [REL]

#1 ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task [PDF²] [Copy] [Kimi²] [REL]