VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

#1 VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval [PDF] [Copy] [Kimi] [REL]

Authors: Diogo Glória-Silva, David Semedo, João Maglhães

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Subjects: Computer Vision and Pattern Recognition , Computation and Language

Publish: 2026-02-22 12:20:28 UTC

2602.19146

#1 VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval [PDF] [Copy] [Kimi] [REL]