Adapting Language-Audio Models as Few-Shot Audio Learners | Cool Papers

#1 Adapting Language-Audio Models as Few-Shot Audio Learners [PDF] [Copy] [Kimi¹] [REL]

Authors: Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang

Contrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zero-shot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to finetune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a cross-attention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter beats metric-based methods in few-shot settings and yields competitive results to fully-supervised methods.

Subject: INTERSPEECH.2023 - Analysis and Assessment

liang23c@interspeech_2023@ISCA

#1 Adapting Language-Audio Models as Few-Shot Audio Learners [PDF] [Copy] [Kimi1] [REL]

#1 Adapting Language-Audio Models as Few-Shot Audio Learners [PDF] [Copy] [Kimi¹] [REL]