kumar25c@interspeech_2025@ISCA

Total: 1

#1 ArticulateX: End-to-End Monolingual Speech Translation in Articulator Space [PDF] [Copy] [Kimi] [REL]

Authors: Vishal Kumar, Vinayak Abrol

We present ArticulateX, the first non-autoregressive direct speech-to-speech translation (S2ST) model that operates through an articulatory latent space, offering an efficient alternative to existing cascaded models. It consists of a direct speech-to-articulator encoder, a latent articulator-to-MelSpectrogram mapper, and a vocoder for high-fidelity speech synthesis. By leveraging articulatory representations, which are inherently language-agnostic, our model effectively captures speech dynamics, preserving speaker identity, prosody and expressiveness across languages. Unlike prior autoregressive models, ArticulateX eliminates the need for intermediate text, discrete units and/or complex self-supervised objectives, enabling faster inference, stable training, and improved translation quality. We demonstrate the efficacy of the proposed model in fr-en and de-en speech-to-speech translation on the CVSS dataset, achieving BLEU scores better or comparable to existing models.

Subject: INTERSPEECH.2025 - Language and Multimodal