Translating speech with just images

oneata24@interspeech_2024@ISCA

Total: 1

#1 Translating speech with just images [PDF] [Copy] [Kimi¹] [REL]

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yoruba, and propose a Yoruba-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.

Subject: INTERSPEECH.2024 - Language and Multimodal

oneata24@interspeech_2024@ISCA

#1 Translating speech with just images [PDF] [Copy] [Kimi1] [REL]

#1 Translating speech with just images [PDF] [Copy] [Kimi¹] [REL]