2506.17608

Total: 1

#1 HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs [PDF2] [Copy] [Kimi1] [REL]

Authors: Nikitha SR, Aradhya Neeraj Mathur, Tarun Ram Menta, Rishabh Jain, Mausoom Sarkar

The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-06-21 06:13:56 UTC