2602.18861

Total: 1

#1 Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation [PDF] [Copy] [Kimi2] [REL]

Authors: Shile Li, Markus Karmann, Onay Urfalioglu

We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a <adjective> photo of <adjective> <cls>".

Subject: Computer Vision and Pattern Recognition

Publish: 2026-02-21 15:02:21 UTC