Total: 1
Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. The expense of large-scale pre-training puts such research out of reach for many, hence limiting scientific advancements. We thus propose a novel pretraining strategy for DINOv2 that simultaneously accelerates convergence–and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum–low-frequency being seen first–and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time is reduced by 1.6×–from 16.64 to 10.32 NVIDIA L40S days–and FLOPs by 2.25×, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with the DINOv2 baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as a means to improve self-supervised learning models robustness.