2024.iwslt-1.18@ACL

Total: 1

#1 Speech Data from Radio Broadcasts for Low Resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Bismarck Bamfo Odoom ; Leibny Paola Garcia Perera ; Prangthip Hansanti ; Loic Barrault ; Christophe Ropers ; Matthew Wiesner ; Kenton Murray ; Alexandre Mourachko ; Philipp Koehn

We created a collection of speech data for 48 low resource languages. The corpus is extracted from radio broadcasts and processed with novel speech detection and language identification models based on a manually vetted subset of the audio for 10 languages. The data is made publicly available.