Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

#1 Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents [PDF] [Copy] [Kimi¹] [REL]

Authors: Ádám Kovács, Bowei He, Xue Liu, István Boros, Szilveszter Tóth, Gábor Recski

Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.

Subject: Computation and Language

Publish: 2026-07-01 13:01:42 UTC

2607.00895

#1 Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents [PDF] [Copy] [Kimi1] [REL]

#1 Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents [PDF] [Copy] [Kimi¹] [REL]