DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding

#1 DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta

Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-14 13:56:39 UTC

2511.11313

#1 DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding [PDF4] [Copy] [Kimi2] [REL]

#1 DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding [PDF⁴] [Copy] [Kimi²] [REL]