Sloparl - slovenian parliamentary speech and text corpus for large vocabulary continuous speech recognition

#1 Sloparl - slovenian parliamentary speech and text corpus for large vocabulary continuous speech recognition [PDF] [Copy] [Kimi] [REL]

Authors: Andrej Zgank, Tomas Rotovnik, Matej Grasic, Marko Kos, Damjan Vlaj, Zdravko Kacic

This paper present a novel Slovenian language resource - SloParl database. It consists from debates acquired in the Slovenian Parliament. The main goal of the project was to cost-effectively collect a new Slovenian language resource that could be used to augment the available Slovenian speech corpora for developing a large vocabulary continuous speech recognition system. The SloParl speech corpus has a total length of 100 hours. The selected sessions between years 2000-2005 were incorporated in it. This speech corpus will be used for lightly supervised or unsupervised acoustic models training. In accordance with this, the accompanying transcriptions were prepared. The second part of the SloParl database is the text corpus, which covers text of all debates from period 1996-2005. It consists of 23M words. It will be used to create different types of speech recognisers language models. Comparison with other Slovenian language resources showed that SloParl database adds new aspects to the modelling of Slovenian language.

Subject: INTERSPEECH.2006 - Analysis and Assessment

zgank06@interspeech_2006@ISCA

#1 Sloparl - slovenian parliamentary speech and text corpus for large vocabulary continuous speech recognition [PDF] [Copy] [Kimi] [REL]