WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

#1 WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation [PDF⁵] [Copy] [Kimi²] [REL]

Authors: Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-27 04:40:37 UTC

2511.22098

#1 WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation [PDF5] [Copy] [Kimi2] [REL]

#1 WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation [PDF⁵] [Copy] [Kimi²] [REL]