2025.findings-emnlp.1363@ACL

Total: 1

#1 Real-World Summarization: When Evaluation Reaches Its Limits [PDF] [Copy] [Kimi] [REL]

Authors: Patrícia Schmidtová, Ondrej Dusek, Saad Mahamood

We examine evaluation of faithfulness to input data in the context of hotel highlights—brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (r=0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

Subject: EMNLP.2025 - Findings