azuh19@interspeech_2019@ISCA

Total: 1

#1 Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio [PDF] [Copy] [Kimi]

Authors: Emmanuel Azuh ; David Harwath ; James Glass

In this paper, we present a method for the discovery of word-like units and their approximate translations from visually grounded speech across multiple languages. We first train a neural network model to map images and their spoken audio captions in both English and Hindi to a shared, multimodal embedding space. Next, we use this model to segment and cluster regions of the spoken captions which approximately correspond to words. Finally, we exploit between-cluster similarities in the embedding space to associate English pseudo-word clusters with Hindi pseudo-word clusters, and show that many of these cluster pairings capture semantic translations between English and Hindi words. We present quantitative cross-lingual clustering results, as well as qualitative results in the form of a bilingual picture dictionary.