Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

#1 Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video [PDF⁴] [Copy] [Kimi⁶] [REL]

Authors: David Yifan Yao, Albert J. Zhai, Shenlong Wang

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Machine Learning

Publish: 2025-03-27 17:57:32 UTC

2503.21761

#1 Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video [PDF4] [Copy] [Kimi6] [REL]

#1 Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video [PDF⁴] [Copy] [Kimi⁶] [REL]