Total: 1
Video data inherently captures rich, dynamic contexts that reveal objects in varying poses, interactions, and state transitions, offering rich potential for unsupervised object representation learning. However, most prior representation learning methods rely on static image datasets like ImageNet, which lack temporal cues and only provide high-level semantic supervision. Meanwhile, existing natural video datasets are not ideal for learning object-centric representations due to limited object focus and class diversity. To explore unsupervised object representation learning grounded in object dynamics--beyond static appearance--we introduce TrackVerse, a large-scale video dataset of 31.9 million object tracks spanning over 1,000 categories, each capturing the motion, appearance, and evolving states of an object over time. We further propose a variance-aware contrastive learning framework that adapts to data augmentations, encouraging the model to learn state-sensitive features. Extensive experiments demonstrate that representations learned from TrackVerse with variance-aware contrastive learning significantly outperform those from static image datasets and non-object-centric natural video across multiple downstream tasks including object/attributie recognition, action recognition and video instance segmentation, highlighting the rich semantic and state content in TrackVerse feature.