Zameni_MOSCATO_Predicting_Multiple_Object_State_Change_Through_Actions@ICCV2025@CVF

Total: 1

#1 MOSCATO: Predicting Multiple Object State Change Through Actions [PDF] [Copy] [Kimi] [REL]

Authors: Parnian Zameni, Yuhan Shen, Ehsan Elhamifar

We introduce MOSCATO: a new benchmark for predicting the evolving states of multiple objects through long procedural videos with multiple actions. While prior work in object state prediction has typically focused on a single object undergoing one or a few state changes, real-world tasks require tracking many objects whose states evolve over multiple actions. Given the high cost of gathering framewise object-state labels for many videos, we develop a weakly-supervised multiple object state prediction framework, which only uses action labels during training. Specifically, we propose a novel Pseudo-Label Acquisition (PLA) pipeline that integrates large language models, vision-language models, and action segment annotations to generate fine-grained, per-frame object-state pseudo-labels for training a Multiple Object State Prediction (MOSP) network. We further devise a State-Action Interaction (SAI) module that explicitly models the correlations between actions and object states, thereby improving MOSP. To facilitate comprehensive evaluation, we create the MOSCATO benchmark by augmenting three egocentric video datasets with framewise object-state annotations. Experiments show that our multi-stage pseudo-labeling approach and SAI module significantly boost performance over zero-shot VLM baselines and naive extensions of existing methods, underscoring the importance of holistic action-state modeling for fine-grained procedural video understanding.

Subject: ICCV.2025 - Poster