M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

#1 M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions [PDF³] [Copy] [Kimi⁵] [REL]

Authors: Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen, Yuqian Wu, Fangyuan Zhang, Qintian Guo, Xiaofang Zhou

Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

Subject: Computation and Language

Publish: 2026-06-05 15:44:18 UTC

2606.07402

#1 M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions [PDF3] [Copy] [Kimi5] [REL]

#1 M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions [PDF³] [Copy] [Kimi⁵] [REL]