Total: 1
Generating a video causally in an autoregressive manner is considered a promising path toward infinite video generation in a flexible context. Prior autoregressive approaches typically rely on vector quantization to convert a video into a discrete-valued space, which could raise challenges in efficiency when modeling long videos. In this work, we propose a novel approach that enables autoregressive video generation without vector quantization. We propose to reformulate the video generation problem as an autoregressive modeling framework integrating temporal \textit{frame-by-frame} prediction and spatial \textit{set-by-set} prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. We train a novel video autoregressive model with the proposed approach, termed \Ours. Our results demonstrate that \Ours fully surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, \ie, 0.6B parameters. \Ours generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Additionally, with a significantly lower training cost, \Ours outperforms state-of-the-art image diffusion models in text-to-image generation tasks. We will release all weights, models, and code to facilitate the reproduction of \Ours and further development.