Double-Filter: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch&Layer Filtering

#1 Double-Filter: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch&Layer Filtering [PDF] [Copy] [Kimi] [REL]

Authors: Yaoqin He, Junchen Fu, Kaiwen Zheng, Songpei Xu, Fuhai Chen, Jie Li, Joemon Jose, Xuri Ge

In this paper, we present a novel approach, termed Double-Filter,to “slim down” the fine-tuning process of vision-language pre-trained (VLP) models via filtering redundancies in feature inputs and architectural components. We enhance the fine-tuning process using two approaches. First, we develop a new patch selection method incorporating image patch filtering through background and foreground separation, followed by a refined patch selection process. Second, we design a genetic algorithm to eliminate redundant fine-grained architecture layers, improving the efficiency and effectiveness of the model. The former makes patch selection semantics more comprehensive, improving inference efficiency while ensuring semantic representation. The latter’s fine-grained layer filter removes architectural redundancy to the extent possible and mitigates the impact on performance. Experimental results demonstrate that the proposed Double-Filter achieves superior efficiency of model fine-tuning and maintains competitive performance compared with the advanced efficient fine-tuning methods on three downstream tasks, VQA, NLVR and Retrieval. In addition, it has been proven to be effective under METER and ViLT VLP models.

Subject: ICML.2025 - Poster

tStRKJKZEI@OpenReview

#1 Double-Filter: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch&Layer Filtering [PDF] [Copy] [Kimi] [REL]