ardaillon22@interspeech_2022@ISCA

Total: 1

#1 Voicing decision based on phonemes classification and spectral moments for whisper-to-speech conversion [PDF] [Copy] [Kimi2]

Authors: Luc Ardaillon ; Nathalie Henrich ; Olivier Perrotin

Cordectomized or laryngectomized patients recover the ability to speak thanks to devices able to produce a natural-sounding voice source in real time. However, constant voicing can impair the naturalness and intelligibility of reconstructed speech. Voicing decision, consisting in identifying whether an uttered phone should be voiced or not, is investi- gated here as an automatic process in the context of whisper-to-speech (W2S) conversion systems. Whereas state-of-the-art approaches apply DNN techniques on high-dimensional acoustic features, we seek here a low-resource alternative approach for a perceptually-meaningful mapping between acoustic features and voicing decision, suitable for real-time applications. Our method first classifies whisper signal frames into phoneme classes based on their spectral centroid and spread, and then discriminate voiced phonemes from their unvoiced counterpart based on class-dependent spectral centroid thresholds. We compared our method to a simpler approach using a single centroid threshold on several databases of annotated whispers in both single-speaker and multi-speaker training setups. While both approaches reach voicing accuracy higher than 91%, the proposed method allows to avoid some systematic voicing decision errors, which may allow users to learn to adapt their speech in real-time to compensate for remaining voicing errors.