liang24@interspeech_2024@ISCA

Total: 1

#1 Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation [PDF1] [Copy] [Kimi1] [REL]

Authors: Dawei Liang, Alice Zhang, David Harwath, Edison Thomaz

Acoustic scene and event classification is gaining traction in mobile health and wearable applications. Traditionally, relevant research focused on high-quality inputs (sampling rates >= 16 kHz). However, lower sampling rates (e.g., 1 kHz - 2 kHz) offer enhanced privacy and reduced power consumption, crucial for continuous mobile use. This study introduces efficient methods for optimizing pre-trained audio neural networks (PANNs) targeting low-quality audio, employing Born-Again self-distillation (BASD) and a cross-sampling-rate self-distillation (CSSD) strategy. Testing three PANNs with diverse mobile datasets reveals that both strategies boost model inference performance, yielding an absolute accuracy / F1 gain ranging from 1% to 6% compared to a baseline without distillation, while sampling at very low rates (1 kHz - 2 kHz). Notably, CSSD shows greater benefits, suggesting models trained on high-quality audio adapt better to lower resolutions, despite the shift in input quality.