Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources

#1 Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources [PDF] [Copy] [Kimi] [REL]

Authors: Zhaoheng Huang, Yutao Zhu, Jirong Wen, Zhicheng Dou

Large language models (LLMs) often produce factually inaccurate content, or hallucinations, which undermines their reliability. Existing factuality evaluation systems usually rely on a single predefined fact source, making them task-specific and hard to extend. We present UFO, a unified framework for factuality evaluation that supports multiple plug-and-play fact sources. UFO integrates human-written evidence, web search results, and LLM knowledge within a single evaluation pipeline, and allows users to flexibly select, reorder, and even define customized sources. The system is accessible through both a Python interface and a web-based demo, offering interactive claim-level verification and visualization. Experiments show that UFO system achieves moderate consistency with human annotations. Overall, UFO serves as a transparent and extensible platform for benchmarking fact sources, comparing LLMs, and enabling real-world fact-checking applications across diverse domains.

Subject: AAAI.2026 - Demonstration Track

42355@AAAI

#1 Evaluating the Factuality of Large Language Models Using Multiple Plug-and-Play Fact Sources [PDF] [Copy] [Kimi] [REL]