2024.iwslt-1.18@ACL

Total: 1

#1 Speech Data from Radio Broadcasts for Low Resource Languages [PDF] [Copy] [Kimi] [REL]

Authors: Bismarck Bamfo Odoom, Leibny Paola Garcia Perera, Prangthip Hansanti, Loic Barrault, Christophe Ropers, Matthew Wiesner, Kenton Murray, Alexandre Mourachko, Philipp Koehn

We created a collection of speech data for 48 low resource languages. The corpus is extracted from radio broadcasts and processed with novel speech detection and language identification models based on a manually vetted subset of the audio for 10 languages. The data is made publicly available.

Subject: IWSLT.2024