xu12@interspeech_2012@ISCA

Total: 1

#1 Phrasal cohort based unsupervised discriminative language modeling [PDF] [Copy] [Kimi1]

Authors: Puyang Xu ; Brian Roark ; Sanjeev Khudanpur

Simulated confusions enable the use of large text-only corpora for discriminative language modeling by hallucinating the likely recognition outputs that each (correct) sentence would be confused with. In [1], a novel approach was introduced to simulate confusions using phrasal cohorts derived directly from recognition output. However, the described approach relied on transcribed speech to derive cohorts. In this paper, we extend the phrasal cohort technique to the fully unsupervised scenario, where transcribed data are completely absent. Experimental results show that even if the cohorts are extracted from untranscribed speech, the unsupervised training can still achieve over 40% of the gains of the supervised approach. The results are presented on NIST data sets for a state-of-the-art LVCSR system.