kim25d@interspeech_2025@ISCA

Total: 1

#1 Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced Alignment [PDF] [Copy] [Kimi1] [REL]

Authors: Dohyun Kim, Jiwook Hwang

In open-vocabulary keyword spotting, an acoustic encoder pre-trained with Connectionist Temporal Classification (CTC) loss is typically used to train a text encoder by aligning audio embedding space with text embedding space. In previous work, word-aligned datasets were created by forced alignment algorithms such as the Montreal Forced Aligner (MFA) to train text encoder and verifier models. In this paper, we propose a new training pipeline for open-vocabulary keyword spotting using the W-CTC forced alignment algorithm, a simple modification of the practical CTC algorithm. Our approach eliminates the need for creating word-aligned datasets, operates in a fully end-to-end manner, and demonstrates superior performance on the Libriphrase hard dataset.

Subject: INTERSPEECH.2025 - Speech Detection