Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

#1 Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents [PDF] [Copy] [Kimi¹] [REL]

Authors: MENGJUN CHENG, Chengquan Zhang, Chang Liu, Yuke Li, Bohan Li, Kun Yao, Xiawu Zheng, Rongrong Ji, Jie Chen

Current methodologies have achieved notable success in the closed-set visual information extraction task, while the exploration into open-vocabulary settings is comparatively underdeveloped, which is practical for individual users in terms of inferring information across documents of diverse types. Existing proposal solutions including NER methods and LLM-based methods fall short in processing the unlimited range of open-vocabulary keys and missing explicit layout modeling. This paper introduces a novel method for tackling the given challenge by transforming the process of categorizing text tokens into a task of locating regions based on given queries also called textual grounding. Particularly, we take this a step further by pairing open-vocabulary key language embedding with corresponding grounded text visual embedding. We design a document-tailored grounding framework by incorporating the layout-aware context learning and document-tailored two-stage pre-training, which significantly improves the model's understanding of documents. Our method outperforms current proposal solutions on the SVRD benchmark for the open-vocabulary VIE task, offering lower costs and faster inference speed. Specifically, our method infers 20x faster than the QwenVL model and achieves an improvement of about 24.3% for the F-score.

Subject: ECCV.2024 - Poster

6219@2024@ECCV

#1 Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents [PDF] [Copy] [Kimi1] [REL]

#1 Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents [PDF] [Copy] [Kimi¹] [REL]