Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

#1 Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning [PDF] [Copy] [Kimi] [REL]

Authors: Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

Subjects: Audio and Speech Processing , Machine Learning

Publish: 2026-06-18 05:55:58 UTC

2606.19823

#1 Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning [PDF] [Copy] [Kimi] [REL]