Total: 1
We present ArticulateX, the first non-autoregressive direct speech-to-speech translation (S2ST) model that operates through an articulatory latent space, offering an efficient alternative to existing cascaded models. It consists of a direct speech-to-articulator encoder, a latent articulator-to-MelSpectrogram mapper, and a vocoder for high-fidelity speech synthesis. By leveraging articulatory representations, which are inherently language-agnostic, our model effectively captures speech dynamics, preserving speaker identity, prosody and expressiveness across languages. Unlike prior autoregressive models, ArticulateX eliminates the need for intermediate text, discrete units and/or complex self-supervised objectives, enabling faster inference, stable training, and improved translation quality. We demonstrate the efficacy of the proposed model in fr-en and de-en speech-to-speech translation on the CVSS dataset, achieving BLEU scores better or comparable to existing models.