Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification

#1 Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification [PDF⁸] [Copy] [Kimi²] [REL]

Authors: Taehan Lee, Woojin Lee, HYUKJUN LEE

Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks. To reduce the high computational cost of ViTs, token pruning has been proposed to selectively remove tokens that are not crucial. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges. In audio processing, distinguishing relevant from non-relevant regions is less straightforward. In this study, we applied token pruning to a ViT-based audio classification model using Mel-spectrograms and analyzed the trade-offs between model performance and computational cost. We show AudioMAE-TopK model can reduce MAC operations by $2\times$ with less than a 1\% decrease in accuracy for both speech command recognition and environmental sound classification. Notably, while many tokens from signal (high-intensity) regions were pruned, tokens from background (low-intensity) regions were frequently retained, indicating the model’s reliance on these regions. In the ablation study, forcing the model to focus only on signal (high-intensity) regions led to lower accuracy, suggesting that background (low-intensity) regions contain unique, irreplaceable information for AudioMAE. In Addition, we find that when token pruning is applied, the supervised pre-trained AST model emphasizes tokens from signal regions more than AudioMAE.

Subject: ICLR.2025 - Poster

SvCOhZRQqa@OpenReview

#1 Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification [PDF8] [Copy] [Kimi2] [REL]

#1 Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification [PDF⁸] [Copy] [Kimi²] [REL]