Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens

#1 Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Haici Yang, Jiaqi Su, Minje Kim, Zeyu Jin

We present a high-fidelity generative speech enhancement model, Genhancer, which generates clean speech as discrete codec tokens while conditioning on the input speech features. Discrete codec tokens provide an efficient latent domain in place of the conventional time or time-frequency domain of signals, so as to enable complex modeling of speech and allow generative modeling to enforce speaker consistency and content continuity. We provide insights into the best-fit generation scheme for enhancement among parallel prediction, auto-regression, and masking to demonstrate the benefits of conditioning on both pre-trained and jointly learned speech features. Subjective and objective tests show that Genhancer significantly improves audio quality and speaker-identity retention over the SOTA baselines, including conventional and generative ones while preserving content accuracy. Audio samples and supplement materials are available at https://minjekim.com/research-projects/genhancer

Subject: INTERSPEECH.2024 - Speech Synthesis

yang24h@interspeech_2024@ISCA

#1 Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens [PDF1] [Copy] [Kimi1] [REL]

#1 Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens [PDF¹] [Copy] [Kimi¹] [REL]