Processing math: 100%

2506.19651

Total: 1

#1 PEVLM: Parallel Encoding for Vision-Language Models [PDF14] [Copy] [Kimi5] [REL]

Authors: Letian Kang, Shixian Luo, Yiqiang Li, Xiaoyang Yu, Shenxuan Zhou, Yong Wu

Vision-Language Models (VLMs) have demonstrated strong performance in video-language tasks, yet their application to long video understanding remains constrained by the quadratic complexity of standard attention mechanisms. In this paper, we propose \textbf{PEVLM}, a parallel encoding strategy specifically designed to improve the prefill efficiency of VLMs without requiring model finetuning. PEVLM partitions the input into block-wise segments with a shared sink, preserves full-attention positional embeddings, and aligns attention weights to mimic full-attention distributions. This design reduces attention computation from O((T×N)2) to O(T×N) while maintaining high accuracy. Extensive experiments on the LongVideoBench benchmark show that PEVLM achieves up to 8.37\% accuracy improvement over existing inference-efficient methods and delivers up to 7.47x speedup in attention computation and 40\% reduction in end-to-end latency. Under strict latency constraints, PEVLM significantly outperforms baselines, raising accuracy from 23.26\% to 61.03\%. These results highlight PEVLM's effectiveness for low-latency, long-context video understanding, making it well-suited for real-world applications such as autonomous driving.

Subjects: Computer Vision and Pattern Recognition , Machine Learning , Performance

Publish: 2025-06-24 14:14:52 UTC