INTERSPEECH.2007 - Speech Recognition

| Total: 118

#1 Noise-robust hands-free voice activity detection with adaptive zero crossing detection using talker direction estimation [PDF] [Copy] [Kimi] [REL]

Authors: Yuki Denda, Takamasa Tanaka, Masato Nakayama, Takanobu Nishiura, Yoichi Yamashita

This paper proposes a novel hands-free voice activity detection (VAD) method utilizing not only temporal features but also spatial features, called adaptive zero crossing detection (AZCD), that uses talker direction estimation. It firstly estimates talker direction to extract two spatial features: spatial reliability and spatial variance, based on weighted cross-power spectrum phase analysis and maximum likelihood estimation. Then, the AZCD detects voice activity frames by robustly detecting zero crossing information of speech with adaptively controlled thresholds using the extracted spatial features in noisy environments. The experimental results in an actual office room confirmed that the VAD performance of the proposed method that utilizes both temporal and spatial features is superior to that of the conventional method that utilizes only the temporal or spatial features.


#2 A robust mel-scale subband voice activity detector for a car platform [PDF] [Copy] [Kimi] [REL]

Authors: A. Álvarez, R. Martínez, P. Gómez, V. Nieto, V. Rodellar

Voice-controlled devices provide a smart solution to operate add-on appliances in a car. Although, speech recognition appears as a key technology to produce useful end-user interfaces, the amount of acoustic disturbances existing in automotive platforms usually prevents satisfactory results. In most of the cases, noise reduction techniques involving a Voice Activity Detector (VAD) are required. Through this paper, a robust method for speech detection under the influence of noise and reverberation in an automobile environment is proposed. This method determines a consistent speech/non-speech discrimination by means of a set of Order-Statistics Filters (OSFs) applied to the log-energies associated to a mel-scale based subband division. The paper also includes an extensive performance evaluation of the algorithm using AURORA3 database recordings. According to our simulation results, the proposed algorithm shows on average a significantly better performance than standard VADs such as ITU-G.729B, GSM-AMR or ETSI-AFE, and other recently reported methods.


#3 Noise robust front-end processing with voice activity detection based on periodic to aperiodic component ratio [PDF] [Copy] [Kimi] [REL]

Authors: Kentaro Ishizuka, Tomohiro Nakatani, Masakiyo Fujimoto, Noboru Miyazaki

This paper proposes a front-end processing method for automatic speech recognition (ASR) that employs a voice activity detection (VAD) method based on the periodic to aperiodic component ratio (PAR). The proposed VAD method is called PARADE (PAR based Activity DEtection). By considering the powers of the periodic and aperiodic components of the observed signals simultaneously, PARADE can detect speech segments more precisely in the presence of noise than conventional VAD methods. In this paper, PARADE is applied to a front-end processing technique that employs a robust feature extraction method called SPADE (Subband based Periodicity and Aperiodicity DEcomposition). The noisy ASR performance was examined with the CENSREC-1-C database, which includes connected continuous digit speech utterances drawn from CENSREC-1 (Japanese version of AURORA-2). The result shows that the SPADE front-end combined with PARADE achieves average word accuracy of 74.22% at signal to noise ratios of 0 to 20 dB. This accuracy is significantly higher than that achieved by the ETSI ES 202 050 front-end (63.66%) and the SPADE front-end without PARADE (64.28%). This result also confirmed that PARADE can improve the performance of front-end processing.


#4 Feature and distribution normalization schemes for statistical mismatch reduction in reverberant speech recognition [PDF] [Copy] [Kimi] [REL]

Authors: A. M. Toh, Roberto Togneri, Sven Nordholm

Reverberant noise has been a major concern in speech recognition systems. Many speech recognition systems, even with state-of-art features, fail to respond to reverberant effects and the recognition rate deteriorates. This paper explores the significance of normalization strategies in reducing statistical mismatches for robust speech recognition in reverberant environment. Most normalization works focused only on ambient noise and have yet been experimented on reverberant noise. In addition, we propose a new approach for the odd order cepstral moment normalization which is computationally more efficient and reduces the convergence rate in the algorithm. The proposed method is experimentally justified and corroborated by the performance of other normalization schemes. The results emphasize the significance of reducing statistical mismatches in feature space for reverberant speech recognition.


#5 Temporal masking for unsupervised minimum Bayes risk speaker adaptation [PDF] [Copy] [Kimi] [REL]

Authors: Matthew Gibson, Thomas Hain

The minimum Bayes risk (MBR) criterion has previously been applied to the task of speaker adaptation in large vocabulary continuous speech recognition. The success of unsupervised MBR speaker adaptation, however, has been limited by the accuracy of the estimated transcription of the acoustic data. This paper addresses this issue not by improving the accuracy of the estimated transcription but via temporal masking of its erroneous regions.


#6 Speech feature compensation based on pseudo stereo codebooks for robust speech recognition in additive noise environments [PDF] [Copy] [Kimi] [REL]

Authors: Tsung-hsueh Hsieh, Jeih-weih Hung

In this paper, we propose several compensation approaches to alleviate the effect of additive noise on speech features for speech recognition. These approaches are simple yet efficient noise reduction techniques that use online constructed pseudo stereo codebooks to evaluate the statistics in both clean and noisy environments. The process yields transforms for noise-corrupted speech features to make them closer to their clean counterparts. We apply these compensation approaches on various well-known speech features, including mel-frequency cepstral coefficients (MFCC), autocorrelation mel-frequency cepstral coefficients (AMFCC) and perceptual linear prediction cepstral coefficients (PLPCC). Experimental results conducted on the Aurora-2 database show that the proposed approaches provide all types of the features with a significant performance gain when compared to the baseline results and those obtained by using the conventional utterance-based cepstral mean and variance normalization (CMVN).


#7 Multiband, multisensor robust features for noisy speech recognition [PDF] [Copy] [Kimi] [REL]

Authors: Dimitrios Dimitriadis, Petros Maragos, Stamatios Lefkimmiatis

This paper presents a novel feature extraction scheme taking advantage of both the nonlinear modulation speech model and the spatial diversity of speech and noise signals in a multisensor environment. Herein, we propose applying robust features to speech signals captured by a multisensor array minimizing a noise energy criterion over multiple frequency bands. We show that we can achieve improved recognition performance by minimizing the Teager-Kaiser energy of the noise-corrupted signals in different frequency bands. These Multiband, Multisensor Cepstral (MBSC) features are inspired by similar ones already been applied to single-microphone noisy Speech Recognition tasks with significantly improved results. The recognition results show that the proposed features can perform better than the widely-used MFCC features.


#8 Noise robust speech recognition for voice driven wheelchair [PDF] [Copy] [Kimi] [REL]

Authors: Akira Sasou, Hiroaki Kojima

In this paper, we introduce a noise robust speech recognition system for a voice-driven wheelchair. Our system has adopted a microphone array system in order for the user not to need to wear a microphone. By mounting the microphone array system on the wheelchair, our system can easily distinguish the user's utterances from other voices without using a speaker identification technique. We have also adopted a feature compensation technique. By combining the microphone array system and the feature compensation technique, our system can be applied to various noise environments. This is because the microphone array system can provide reliable information about voice activity detection to the feature compensation method, and the feature compensation method can compensate for the weak point of the microphone array system, which is that the microphone array system tends to be less effective for omni-directional noises.


#9 Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions [PDF] [Copy] [Kimi] [REL]

Authors: Yu Hu, Qiang Huo

In a traditional HMM compensation approach to robust speech recognition that uses Vector Taylor Series (VTS) approximation of an explicit model of environmental distortions, the set of generic HMMs are typically trained from "clean" speech only. In this paper, we present a maximum likelihood approach to training generic HMMs from both "clean" and "corrupted" speech based on the concept of irrelevant variability normalization. Evaluation results on Aurora2 connected digits database demonstrate that the proposed approach achieves significant improvements in recognition accuracy compared to the traditional VTS-based HMM compensation approach.


#10 On the jointly unsupervised feature vector normalization and acoustic model compensation for robust speech recognition [PDF] [Copy] [Kimi] [REL]

Authors: Luis Buera, Antonio Miguel, Eduardo Lleida, Óscar Saz, Alfonso Ortega

To compensate the mismatch between training and testing conditions, an unsupervised hybrid compensation technique is proposed. It combines Multi-Environment Model based LInear Normalization (MEMLIN) with a novel acoustic model adaptation method based on rotation transformations. A set of rotation transformations is estimated between clean and MEMLIN-normalized data by linear regression in a training process. Thus, each MEMLIN-normalized frame is decoded using the expanded acoustic models, which are obtained from the reference ones and the set of rotation transformations. During the search algorithm, one of the rotation transformations is on-line selected for each frame according to the ML criterion in a modified Viterbi algorithm. Some experiments with Spanish SpeechDat Car database were carried out. MEMLIN over standard ETSI front-end parameters reaches 75.53% of mean improvement in WER, while the introduced hybrid solution goes up to 90.54%.


#11 An ensemble modeling approach to joint characterization of speaker and speaking environments [PDF] [Copy] [Kimi] [REL]

Authors: Yu Tsao, Chin-Hui Lee

We propose an ensemble modeling framework to jointly characterize speaker and speaking environments for robust speech recognition. We represent a particular environment by a super-vector formed by concatenating the entire set of mean vectors of the Gaussian mixture components in its corresponding hidden Markov model set. In the training phase we generate an ensemble speaker and speaking environment super-vector by concatenating all the super-vectors trained on data from many real or simulated environments. In the recognition phase the ensemble speaker and speaking environment super-vector is converted to the super-vector for the testing environment with an affine transformation that is estimated online with a maximum likelihood (ML) algorithm. We used a simplified formulation for the proposed approach and evaluated its performance on the Aurora 2 database. In an unsupervised adaptation mode, the proposed approach achieves 7.27% and 13.68% WER reductions, respectively, when tested in clean and averaged noisy conditions (from 0dB to 20dB) over the baseline performance on a gender dependent system. The results suggest that the proposed approach can well characterize environments under the presence of either single or multiple distortion sources.


#12 Cluster-based polynomial-fit histogram equalization (CPHEQ) for robust speech recognition [PDF] [Copy] [Kimi] [REL]

Authors: Shih-Hsiang Lin, Yao-Ming Yeh, Berlin Chen

Noise robustness is one of the primary challenges facing most automatic speech recognition (ASR) systems. A vast amount of research efforts on preventing the degradation of ASR performance under various noisy environments have been made during the past several years. In this paper, we consider the use of histogram equalization (HEQ) for robust ASR. In contrast to conventional methods, a novel data fitting method based on polynomial regression was presented to efficiently approximate the inverse of the cumulative density functions of speech feature vectors for HEQ. Moreover, a more elaborate attempt of using such polynomial regression models to directly characterizing the relationship between the speech feature vectors and their corresponding probability distributions, under various noise conditions, was proposed as well. All experiments were carried out on the Aurora-2 database and task. The performance of the presented methods were extensively tested and verified by comparison with the other methods. Experimental results shown that for clean-condition training, our method achieved a considerable word error rate reduction over the baseline system, and also significantly outperformed the other methods.


#13 Robust distributed speech recognition using histogram equalization and correlation information [PDF] [Copy] [Kimi1] [REL]

Authors: Pedro M. Martinez, Jose C. Segura, Luz Garcia

In this paper, we propose a noise compensation method for robust speech recognition in DSR (Distributed Speech Recognition) systems based on histogram equalization and correlation information. The objective of this method is to exploit the correlation between components of the feature vector and the temporal correlation between consecutive frames of each component. The recognition experiments, including results in the Aurora 2, Aurora 3-Spanish and Aurora 3-Italian databases, demonstrate that the use of this correlation information increases the recognition accuracy.


#14 Predictive minimum Bayes risk classification for robust speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Jen-Tzung Chien, Koichi Shinoda, Sadaoki Furui

This paper presents a new Bayes classification rule towards minimizing the predictive Bayes risk for robust speech recognition. Conventionally, the plug-in maximum a posteriori (MAP) classification is constructed by adopting nonparametric loss function and deterministic model parameters. Speech recognition performance is limited due to the environmental mismatch and the ill-posed model. Concerning these issues, we develop the predictive minimum Bayes risk (PMBR) classification where the predictive distributions are inherent in Bayes risk. More specifically, we exploit the Bayes loss function and the predictive word posterior probability for Bayes classification. Model mismatch and randomness are compensated to improve generalization capability in speech recognition. In the experiments on car speech recognition, we estimate the prior densities of hidden Markov model parameters from adaptation data. With the prior knowledge of new environment and model uncertainty, PMBR classification is realized and evaluated to be better than MAP, MBR and Bayesian predictive classification.


#15 Applying word duration constraints by using unrolled HMMs [PDF] [Copy] [Kimi1] [REL]

Authors: Ning Ma, Jon Barker, Phil Green

Conventional HMMs have weak duration constraints. In noisy conditions, the mismatch between corrupted speech signals and models trained on clean speech may cause the decoder to produce word matches with unrealistic durations. This paper presents a simple way to incorporate word duration constraints by unrolling HMMs to form a lattice where word duration probabilities can be applied directly to state transitions. The expanded HMMs are compatible with conventional Viterbi decoding. Experiments on connected-digit recognition show that when using explicit duration constraints the decoder generates word matches with more reasonable durations, and word error rates are significantly reduced across a broad range of noise conditions.


#16 Evaluating the temporal structure normalisation technique on the Aurora-4 task [PDF] [Copy] [Kimi1] [REL]

Authors: Xiong Xiao, Eng Siong Chng, Haizhou Li

We evaluate the temporal structure normalisation (TSN), a feature normalisation technique for robust speech recognition, on the large vocabulary Aurora-4 task. The TSN technique operates by normalising the trend of the feature's power spectral density (PSD) function to a reference function using finite impulse response (FIR) filters. The features are the cepstral coefficients and the normalisation procedure is performed on every cepstral channel of each utterance. Experimental results show that the TSN reduces the average word error rate (WER) by 7.20% and 8.16% relatively over the mean-variance normalisation (MVN) and the histogram equalisation (HEQ) baselines respectively. We further evaluate two other state-of-the-art temporal filters. Experimental results show that among the three evaluated temporal filters, the TSN filter performs the best. Lastly, our results also demonstrates that fixed smoothing filters are less effective on Aurora-4 task than on Aurora-2 task.


#17 Two-stage system for robust neutral/lombard speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Hynek Bořil, Petr Fousek, Harald Höge

Performance of current speech recognition systems is significantly deteriorated when exposed to strongly noisy environment. It can be attributed to background noise and Lombard effect (LE). Attempts for LE-robust systems often display a tradeoff between LE-specific improvements and the portability to neutral speech. Therefore, towards LE-robust recognition, it seems effective to use a set of conditions-dedicated subsystems driven by a condition classifier, rather than attempting for one universal recognizer.


#18 Noise suppression using search strategy with multi-model compositions [PDF] [Copy] [Kimi1] [REL]

Authors: Takatoshi Jitsuhiro, Tomoji Toriyama, Kiyoshi Kogure

We introduce a new noise suppression method by using a search strategy with multi-model compositions that includes the following models: speech, noise, and their composites. Before noise suppression, a beam search is performed to find the best sequences of these models using noise acoustic models, noise-label n-gram models, and a noise-label lexicon. Noise suppression is frame-synchronously performed by the multiple models selected by the search. We evaluated this method using the E-Nightingale task, which contains voice memoranda spoken by nurses during actual work at hospitals. For this difficult task, the proposed method obtained a 21.6% error reduction rate.


#19 Investigations into early and late reflections on distant-talking speech recognition toward suitable reverberation criteria [PDF] [Copy] [Kimi1] [REL]

Authors: Takanobu Nishiura, Yoshiki Hirano, Yuki Denda, Masato Nakayama

Reverberation-robust speech recognition has become very important in the recognition of distant-talking speech. However, as no common reverberation criteria for the recognition of reverberant-speech have been proposed, it has been difficult to estimate this. We have thus focused on a reverberation criterion for the recognition of distant-talking speech. The reverberation time is generally currently used as a reverberation criterion for the recognition of distant-talking speech. This is unique and does not depend on the position of the source in a room. However, distant-talking speech recognition greatly depends on the location of the talker relative to that of the microphone and the distance between them. We investigated a suitable reverberation criterion with the ISO3382 acoustic parameters for distant-talking speech recognition to overcome this problem. We first calculated distant-talking speech recognition with early and late reflections based on the impulse response between the talker and microphone. As a result, we found that early reflections within about 12.5 ms from the duration of direct sound contributed slightly to distant-talking speech recognition in non-noisy environments. We then evaluated it based on ISO3382 acoustic parameters. We consequently confirmed that the ISO3382 acoustic parameters are strong candidates for the new reverberation criteria for distant-talking speech recognition.


#20 An approach to iterative speech feature enhancement and recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Stefan Windmann, Reinhold Haeb-Umbach

In this paper we propose a novel iterative speech feature enhancement and recognition architecture for noisy speech recognition. It consists of model-based feature enhancement employing Switching Linear Dynamical Models (SLDM), a hidden Markov Model (HMM) decoder and a state mapper, which maps HMM to SLDM states. To consistently adhere to a Bayesian paradigm, posteriors are exchanged between these processing blocks. By introducing the feedback from the recognizer to the enhancement stage, enhancement can exploit both the SLDMs ability to model short-term dependencies and the HMMs ability to model long-term dependencies present in the speech data. Experiments have been conducted on the Aurora II database, which demonstrate that significant word accuracy improvements are obtained at low signal-to-noise ratios.


#21 Optimization of temporal filters in the modulation frequency domain for constructing robust features in speech recognition [PDF] [Copy] [Kimi1] [REL]

Author: Jeih-weih Hung

In this paper, we derive new data-driven temporal filters that employ the statistics of the modulation spectra of the speech features. The new temporal filtering approaches are based on the constrained version of Principal Component Analysis (C-PCA) and Maximum Class Distance (C-MCD), respectively. It is shown that the proposed C-PCA and C-MCD temporal filters can effectively improve the speech recognition accuracy in various noise corrupted environments. In experiments conducted on Test Set A of the Aurora-2 noisy digits database, these new temporal filters, together with cepstral mean and variance normalization (CMVN), provides average relative error reduction rates of over 40% and 27%, when compared with the baseline MFCC processing and CMVN alone, respectively.


#22 The harming part of room acoustics in automatic speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Rico Petrick, Kevin Lohde, Matthias Wolff, Rüdiger Hoffmann

Automatic speech recognition (ASR) systems used in real indoor scenarios suffer from different noise and reverberation conditions compared to the training conditions. This article describes a study which aims to find out what are the most harming parts of reverberation to speech recognition. Noise influences are left out. Therefore different real room impulse responses in different rooms and different speaker to microphone distances are measured and modified. The results of the recognition experiments with the related convoluted impulse responses clearly show the dependency of early and late as well as high and low frequency reflections. Conclusions concerning the design of a dereverberation method are made.


#23 A reference model weighting-based method for robust speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Yuan Fu Liao, Yh-Her Yang, Chi-Hui Hsu, Cheng-Chang Lee, Jing-Teng Zeng

In this paper a reference model weighting (RMW) method is proposed for fast hidden Markov model (HMM) adaptation which aims to use only one input test utterance to online estimate the characteristic of the unknown test noisy environment. The idea of RMW is to first collect a set of reference HMMs in the training phase to represent the space of noisy environments, and then synthesize a suitable HMM for the unknown test noisy environment by interpolating the set of reference HMMs. Noisy environment mismatch can hence be efficiently compensated. The proposed method was evaluated on the multi-condition training task of Aurora2 corpus. Experimental results showed that the proposed RMW approach outperformed both the histogram equalization (HEQ) method and the distributed speech recognition (DSR) standard ES 202 212 proposed by European Telecommunications Standards Institute (ETSI).


#24 Mel sub-band filtering and compression for robust speech recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Babak Nasersharif, Ahmad Akbari, Mohammad Mehdi Homayounpour

The Mel-frequency cepstral coefficients (MFCC) are commonly used in speech recognition systems. But, they are high sensitive to presence of external noise. In this paper, we propose a noise compensation method for Mel filter bank energies and so MFCC features. This compensation method is performed in two stages: Mel sub-band filtering and then compression of Mel-sub-band energies. In the compression step, we propose a sub-band SNR-dependent compression function. We use this function in place of logarithm function in conventional MFCC feature extraction in presence of additive noise. Results show that the proposed method significantly improves MFCC features performance in noisy conditions where it decreases average word error rate up to 30% for isolated word recognition on three test sets of Aurora 2 database.


#25 Clustered maximum likelihood linear basis for rapid speaker adaptation [PDF] [Copy] [Kimi1] [REL]

Authors: Yun Tang, Richard Rose

Speaker space based adaptation methods for automatic speech recognition have been shown to provide significant performance improvements for tasks where only a few seconds of adaptation speech is available. This paper proposes a robust, low complexity technique within this general class that has been shown to reduce word error rate, reduce the large storage requirements associated with speaker space approaches, and eliminate the need for large numbers of utterances per speaker in training. The technique is based on representing speakers as a linear combination of clustered linear basis vectors and a procedure is presented for ML estimation these vectors from training data. Significant word error rate reduction was obtained relative to speaker independent performance for the Resource Management and Wall Street Journal task domains.