Pay More Attention to Images: Numerous Images-Oriented Multimodal Summarization

#1 Pay More Attention to Images: Numerous Images-Oriented Multimodal Summarization [PDF] [Copy] [Kimi] [REL]

Authors: Min Xiao, Junnan Zhu, Feifei Zhai, Chengqing Zong, Yu Zhou

Existing multimodal summarization approaches struggle with scenarios involving numerous images as input, leading to a heavy load for readers. Summarizing both the input text and numerous images helps readers quickly grasp the key points of multimodal input. This paper introduces a novel task, Numerous Images-Oriented Multimodal Summarization (NIMMS). To benchmark this task, we first construct the dataset based on a public multimodal summarization dataset. Considering that most existing metrics evaluate summaries from a unimodal perspective, we propose a new Multimodal Information evaluation (M-info) method, measuring the differences between the generated summary and the multimodal input. Finally, we compare various summarization methods on NIMMS and analyze associated challenges. Experimental results have shown that M-info correlates more closely with human judgments than five widely used metrics. Meanwhile, existing models struggle with summarizing numerous images. We hope that this research will shed light on the development of multimodal summarization. Furthermore, our code and dataset will be released to the public.

Subject: NAACL.2025 - Long Papers

2025.naacl-long.474@ACL

#1 Pay More Attention to Images: Numerous Images-Oriented Multimodal Summarization [PDF] [Copy] [Kimi] [REL]