vitale24@interspeech_2024@ISCA

Total: 1

#1 Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers’ ability to model hesitation phenomena [PDF1] [Copy] [Kimi2] [REL]

Authors: Vincenzo Norman Vitale ; Loredana Schettino ; Francesco Cutugno

Modern automatic speech recognition systems can achieve remarkable performances. However, they usually neglect speech characteristic phenomena such as fillers ( ) or segmental prolongations (the ) which are still only considered as disrupting objects to be detected and removed, despite their acknowledged regularity and procedural value. This study investigates the ability of state-of-the-art systems based on end-to-end models (E2E-ASRs) to model distinctive features of hesitation phenomena. Two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders are evaluated: the Connectionist Temporal Classification (CTC) decoder and a Transducer decoder. E2E-ASRs ability to model the acoustic information tied to such phenomena can be exploited rather than disregarded as a noise source, which would not only improve transcription and support linguistic annotation processes, but also deepen our understanding of the systems’ working.