The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators

#1 The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators [PDF¹] [Copy] [Kimi] [REL]

Authors: Tianruo Rose Xu, Vedant Gaur, Liu Leqi, Tanya Goyal

LLM judges have gained popularity as an inexpensive and performant substitute for human evaluation. However, we observe that the meta-evaluation setting in which the reliability of these LLM evaluators is established is substantially different from their use in model development. To address this, we revisit meta-evaluations of LLM evaluators under a setting that more closely aligns with practice by examining evaluators’ ability to distinguish test system pairs that are closer in capability. Our fine-grained approach shows that all LLM evaluator’s correlations with human judgments are concerningly low when the models perform similarly, showcasing a key limitation of current norms. Equipped with this better methodology, we next analyze the impact that the choice of the reference model makes to LLM-as-a-judge evaluator performance. We show that single-reference evaluators only perform well at ranking test systems that fall within particular capability ranges, even if the standard meta-evaluation reports high overall correlation. Taken together, our analysis shows critical issues with current LLM meta-evaluation and recommend avenues for improvement.

Subject: EMNLP.2025 - Findings

2025.findings-emnlp.1036@ACL

#1 The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators [PDF1] [Copy] [Kimi] [REL]

#1 The Progress Illusion: Revisiting meta-evaluation standards of LLM evaluators [PDF¹] [Copy] [Kimi] [REL]