Total: 1
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls, validating that correct answers cannot be inferred solely from textual cues or answer shortcuts. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth. Our evaluation of 20 frontier models highlights a significant performance gap between the leading models and human experts. Through comprehensive error analysis, case studies, and an exploration of retrieval-augmented generation methods, we offer actionable insights to guide future advancements.