wTBhWbCRpN@OpenReview

Total: 1

#1 Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration [PDF] [Copy] [Kimi] [REL]

Authors: Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin

Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi‐Scale Temporal Prediction (MSTP) task in general and surgical scene by decomposing multi‐scale into two orthogonal dimensions: the temporal scale, forecasting states of human and surgery at varying look‐ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scene. For instance in general scene, states of contacting relationship are finer-grained than states of spatial relationship. For instance in surgical scene, medium‐level steps are finer‐grained than high‐level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a novel method, Incremental Generation and Multi‐agent Collaboration (IG-MC), which integrates two key innovations. Firstly, we propose an plug-and-play incremental generation to keep high-quality temporal prediction that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, ensuring decision content and generated visuals remain synchronized and preventing performance degradation as look‐ahead intervals lengthen. Secondly, we propose a decision‐driven multi‐agent collaboration framework for multiple states prediction, comprising generation, initiation, and multi‐state assessment agents that dynamically triggers and evaluates prediction cycles to balance global coherence and local fidelity. Extensive experiments on the MSTP Benchmark in general and surgical scene show that IG‐MC is a generalizable plug-and-play method for MSTP, demonstrating the effectiveness of incremental generation and the stability of decision‐driven multi‐agent collaboration.

Subject: NeurIPS.2025 - Poster