Visual Variational Autoencoder Prompt Tuning | Cool Papers

#1 Visual Variational Autoencoder Prompt Tuning [PDF¹] [Copy] [Kimi] [REL]

Authors: Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, Min Xu

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial approach for adapting large vision transformers to downstream tasks without the prohibitive computational costs of full fine-tuning. While existing visual prompt tuning (VPT) methods have made significant strides, they predominantly rely on static, domain-specific prompts that fail to capture the rich visual diversity within individual instances. This paper introduces V $^2$ APT (Visual Variational Autoencoder Prompt Tuning), a novel framework that generates dynamic, input-dependent prompts using a variational autoencoder architecture. By learning a latent representation of image-specific features and decoding them into customized prompts, V $^2$ APT adapts to the unique visual characteristics of each input. Extensive experiments on FGVC, HTA, and VTAB-1k benchmarks demonstrate that our approach consistently outperforms state-of-the-art PEFT methods. Notably, V $^2$ APT achieves +3.2\% improvement over VPT-Deep on HTA, with an average performance gain of +2.0\% across all three datasets.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-03-22 04:59:51 UTC

2503.17650

#1 Visual Variational Autoencoder Prompt Tuning [PDF1] [Copy] [Kimi] [REL]

#1 Visual Variational Autoencoder Prompt Tuning [PDF¹] [Copy] [Kimi] [REL]