HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

#1 HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection [PDF] [Copy] [Kimi²] [REL]

Authors: Deanna Emery, Michael Goitia, Freddie Vargus, Iulia Neagu

As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content $\unicode{x2013}$ text that is not grounded in supporting evidence $\unicode{x2013}$ has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems $\unicode{x2013}$ both open and closed source $\unicode{x2013}$ highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2025-05-01 13:22:45 UTC

2505.00506

#1 HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection [PDF] [Copy] [Kimi2] [REL]

#1 HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection [PDF] [Copy] [Kimi²] [REL]