Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati

2025.findings-emnlp.1224@ACL

Total: 1

#1 Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati [PDF] [Copy] [Kimi] [REL]

Authors: Sanjay Booshanam, Kelly Chen, Ondrej Klejch, Thomas Reitmaier, Dani Kalarikalayil Raju, Electra Wallington, Nina Markl, Jennifer Pearson, Matt Jones, Simon Robinson, Peter Bell

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.

Subject: EMNLP.2025 - Findings