From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

#1 From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge [PDF¹] [Copy] [Kimi] [REL]

Authors: Agnese Taluzzi, Davide Gesualdi, Riccardo Santambrogio, Chiara Plizzari, Francesca Palermo, Simone Mentasti, Matteo Matteucci

This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-06-10 08:21:38 UTC

2506.08553

#1 From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge [PDF1] [Copy] [Kimi] [REL]

#1 From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge [PDF¹] [Copy] [Kimi] [REL]