MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

#1 MMVU: Measuring Expert-Level Multi-Discipline Video Understanding [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, Chengye Wang, Ziyao Shangguan, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls, validating that correct answers cannot be inferred solely from textual cues or answer shortcuts. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth. Our evaluation of 20 frontier models highlights a significant performance gap between the leading models and human experts. Through comprehensive error analysis, case studies, and an exploration of retrieval-augmented generation methods, we offer actionable insights to guide future advancements.

Subject: CVPR.2025 - Poster

Zhao_MMVU_Measuring_Expert-Level_Multi-Discipline_Video_Understanding@CVPR2025@CVF

#1 MMVU: Measuring Expert-Level Multi-Discipline Video Understanding [PDF1] [Copy] [Kimi1] [REL]

#1 MMVU: Measuring Expert-Level Multi-Discipline Video Understanding [PDF¹] [Copy] [Kimi¹] [REL]