Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

#1 Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning [PDF¹] [Copy] [Kimi] [REL]

Authors: Huabin Liu, Filip Ilievski, Cees G. M. Snoek

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video- and image-based VLMs across reasoning types.To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrite VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

Subject: CVPR.2025 - Poster

Liu_Commonsense_Video_Question_Answering_through_Video-Grounded_Entailment_Tree_Reasoning@CVPR2025@CVF

#1 Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning [PDF1] [Copy] [Kimi] [REL]

#1 Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning [PDF¹] [Copy] [Kimi] [REL]