mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

#1 mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [PDF³] [Copy] [Kimi²] [REL]

Authors: Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data will be publicly available.

Subject: ACL.2025 - Long Papers

2025.acl-long.291@ACL

#1 mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [PDF3] [Copy] [Kimi2] [REL]

#1 mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [PDF³] [Copy] [Kimi²] [REL]