2505.21356

Total: 1

#1 Towards Robust Automated Perceptual Voice Quality Assessment with Deep Learning [PDF1] [Copy] [Kimi1] [REL]

Authors: Whenty Ariyanti, Kuan-Yu Chen, Sabato Marco Siniscalchi, Hsin-Min Wang, Yu Tsao

Objective: Perceptual voice quality assessment plays a critical role in diagnosing and monitoring voice disorders by providing standardized evaluation of vocal function. Traditionally, this process relies on expert raters utilizing standard scales, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS). However, these metrics are inherently subjective and susceptible to inter-rater variability, motivating the need for automated and objective assessment methods. Methods: We propose Voice Quality Assessment Network (VOQANet), a deep learning-based framework with an attention mechanism that leverages a Speech Foundation Model (SFM) to capture high-level acoustic and prosodic information from raw speech. To enhance robustness and interpretability, we present VOQANet+, which integrates handcrafted acoustic features such as jitter, shimmer, and harmonics-to-noise ratio (HNR) with SFM embeddings. Results: Sentence-based input yields stronger performance than vowel-based input, especially at the patient level. VOQANet consistently outperforms baseline methods in RMSE and PCC, while VOQANet+ performs even better and maintains robustness under noisy conditions. Conclusion: Combining SFM embeddings with domain-informed acoustic features improves interpretability and resilience. Significance: VOQANet+ shows strong potential for deployment in real-world and telehealth settings, addressing the limitations of subjective perceptual assessments with an interpretable and noise-resilient solution.

Subjects: Sound , Machine Learning , Audio and Speech Processing

Publish: 2025-05-27 15:48:17 UTC