How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

#1 How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects [PDF] [Copy] [Kimi¹] [REL]

Authors: Abhivansh Gupta, Simardeep Singh, Advika Sinha, Shreyansh Modi, Akshat Tomar

Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

Subjects: Machine Learning , Artificial Intelligence

Publish: 2026-06-07 18:45:29 UTC

2606.08777

#1 How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects [PDF] [Copy] [Kimi1] [REL]

#1 How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects [PDF] [Copy] [Kimi¹] [REL]