High-Fidelity Simultaneous Speech-To-Speech Translation

#1 High-Fidelity Simultaneous Speech-To-Speech Translation [PDF] [Copy] [Kimi] [REL]

Authors: Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, Neil Zeghidour

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart --where one waits for the end of the source utterance to start translating-- adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples on *huggingface.co/spaces/kyutai/hibiki-samples* as well as models and inference code at *github.com/kyutai-labs/hibiki*.

Subject: ICML.2025 - Poster

fgjN8B6xVX@OpenReview

#1 High-Fidelity Simultaneous Speech-To-Speech Translation [PDF] [Copy] [Kimi] [REL]