Total: 1
Current video training methods rely on fixed spatiotemporal sampling grids to extract a predetermined number of tokens, limiting adaptability to diverse computational budgets and resulting in suboptimal accuracy-computation trade-offs. This rigidity constrains high-performance models trained in resource-rich environments from being efficiently deployed on resource-constrained devices. We hence introduce a novel paradigm for lossless adaptation across scenarios, enabling models to maintain optimal performance under high-resource conditions while seamlessly transferring to low-resource environments. Central to this is Token Optimization (TO), an adaptive inference framework that dynamically samples and selects input token set to optimize input information under varied computational constraints. To support this, we propose Flux, an augmentation tool that enables flexible sampling grids and token selection. It integrates seamlessly into popular video training frameworks, significantly enhancing model robustness and adaptability with negligible additional cost. Applied to large-scale video pretraining, our method produces FluxViT, which achieves state-of-the-art performance across multiple tasks under standard costs. Remarkably, with only 1/4 of the tokens, FluxViT matches prior state-of-the-art models under TO across tasks, achieving nearly 90% computational savings. Code and models will be available at https://github.com/OpenGVLab/FluxViT.