Total: 1
Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce deterministic predictions that implicitly average over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaWorld, a generative VFM-based world model that efficiently generates diverse plausible futures. At the core of DeltaWorld is DeltaTok, a tokenizer that encodes the feature difference between consecutive frames into a single continuous "delta" token, reducing video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence. For example, this yields a 1,024x token reduction with 512x512 frames. Delta tokens enable efficient and effective multi-hypothesis training, where many diverse futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Project page: https://deltatok.github.io.