MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

#1 MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models [PDF¹] [Copy] [Kimi³] [REL]

Authors: Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap

Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

Subject: Computation and Language

Publish: 2025-09-16 02:36:00 UTC

2509.12591

#1 MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models [PDF1] [Copy] [Kimi3] [REL]

#1 MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models [PDF¹] [Copy] [Kimi³] [REL]