Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

#1 Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput [PDF⁹] [Copy] [Kimi⁵] [REL]

Authors: Bo Zhang, Shuo Li, Runhe Tian, Yang Yang, Jixin Tang, Jinhao Zhou, Lin Ma

In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-05-14 15:45:17 UTC

2505.09498

#1 Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput [PDF9] [Copy] [Kimi5] [REL]

#1 Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput [PDF⁹] [Copy] [Kimi⁵] [REL]