Total: 1
In open-vocabulary keyword spotting, an acoustic encoder pre-trained with Connectionist Temporal Classification (CTC) loss is typically used to train a text encoder by aligning audio embedding space with text embedding space. In previous work, word-aligned datasets were created by forced alignment algorithms such as the Montreal Forced Aligner (MFA) to train text encoder and verifier models. In this paper, we propose a new training pipeline for open-vocabulary keyword spotting using the W-CTC forced alignment algorithm, a simple modification of the practical CTC algorithm. Our approach eliminates the need for creating word-aligned datasets, operates in a fully end-to-end manner, and demonstrates superior performance on the Libriphrase hard dataset.