CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

#1 CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models [PDF¹] [Copy] [Kimi⁵] [REL]

Authors: Nay Myat Min, Long H. Pham, Hongyu Zhang, Jun Sun

Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.

Subjects: Cryptography and Security , Artificial Intelligence

Publish: 2026-01-19 08:07:03 UTC

2601.14310

#1 CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models [PDF1] [Copy] [Kimi5] [REL]

#1 CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models [PDF¹] [Copy] [Kimi⁵] [REL]