Reconstruction as a Bridge for Event-Based Visual Question Answering

#1 Reconstruction as a Bridge for Event-Based Visual Question Answering [PDF²] [Copy] [Kimi³] [REL]

Authors: Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi

Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-12-12 12:16:45 UTC

2512.11510

#1 Reconstruction as a Bridge for Event-Based Visual Question Answering [PDF2] [Copy] [Kimi3] [REL]

#1 Reconstruction as a Bridge for Event-Based Visual Question Answering [PDF²] [Copy] [Kimi³] [REL]