2410.18850

Total: 1

#1 We Augmented Whisper With kNN and You Won't Believe What Came Next [PDF1] [Copy] [Kimi2] [REL]

Authors: Maya K. Nachesa ; Vlad Niculae

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-10-24 15:32:52 UTC