Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

#1 Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis [PDF²] [Copy] [Kimi] [REL]

Authors: Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Machine Learning

Publish: 2026-06-08 15:40:32 UTC

2606.09646

#1 Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis [PDF2] [Copy] [Kimi] [REL]

#1 Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis [PDF²] [Copy] [Kimi] [REL]