| Total: 78
We propose a low complexity unit-selection algorithm for ultra low bit-rate speech coding based on a first-stage N-best prequantization lattice and a second-stage run-length constrained Viterbi search to efficiently approximate the complete search space of the fully-optimal unit-selection algorithm recently proposed by us. By this, the proposed low complexity algorithm continues to be near-optimal in terms of rate-distortion performance while having highly reduced complexity.
This paper introduces a very fast search algorithm of algebraic fixed codebook in CELP-based speech codecs. The proposed method searches codebook pulses sequentially, and recomputes the fixed codebook gain, the so-called backward filtered target vector, and a certain reference signal after each new pulse is determined. This results in a significant complexity reduction compared to existing methods while preserving the same speech quality. The presented algorithm is used in the new embedded speech and audio codec (G.EV-VBR) being currently standardized by the ITU-T.
In this paper, a novel transcoding algorithm specially related to codebook gain conversion is proposed between AMR-NB at 7.95kb/s and G.729a. It can bypass the gain predictive process and directly convert codebook gain parameters. Additionally, the new gain parameter conversion method can be extended to the other rate modes of AMR-NB while transcoding with G.729a. The experimental results show that the quality of transcoded speech using the proposed algorithm is improved greatly and the computational complexity is reduced by 85% compared with DTE (Decode then Encode) method. A 5ms look-ahead delay is avoided as well.
We present a novel MFCC-based scheme for the Bandwidth Extension (BWE) of narrowband speech. BWE is based on the assumption that narrowband speech (0.3.3.4 kHz) correlates closely with the highband signal (3.4.7 kHz), enabling estimation of the highband frequency content given the narrow band. While BWE schemes have traditionally used LP-based parametrizations, our recent work has shown that MFCC parametrization results in higher correlation between both bands reaching twice that using LSFs. By employing high-resolution IDCT of highband MFCCs obtained from narrowband MFCCs by statistical estimation, we achieve high-quality highband power spectra from which the time-domain speech signal can be reconstructed. Implementing this scheme for BWE translates the higher correlation advantage of MFCCs into BWE performance superior to that obtained using LSFs, as shown by improvements in log-spectral distortion as well as Itakura-based measures (the latter improving by up to 13%).
The ITU-T G.711.1 embedded wideband speech codec was approved by ITU-T in March 2008. This codec generates a bitstream comprised of three layers: a G.711 compatible core layer with noise shaping, a lower band enhancement layer and an MDCT-based higher band enhancement layer. It contains also an optional postprocessing module called Appendix I designed to improve quality of the decoded speech in case of interoperability condition with legacy G.711 encoder. The improvement is achieved by a novel low complexity PCM quantization noise reduction technique described in this article. Subjective test results show that the quality of the interoperability mode with the legacy G.711 codec is significantly better when the Appendix I is activated.
In this contribution, a new instrumental measure for end-to-end speech transmission quality is presented which is based on perceptually relevant dimensions. The paper describes the complete scientific development process of such a measure, starting off from the general framework and concluding with the concrete realization. The measure is based on the dimensions "discontinuity", "noisiness", and "coloration", which were identified through multidimensional analyses. Three dimension estimators are introduced which are capable to predict so-called dimension impairment factors on the basis of signal parameters. Each dimension impairment factor reflects the degradation with respect to a single perceptual dimension. By combining the impairment factors, integral quality can be estimated. A maximum correlation of r = 0.9 with auditory test results is achieved for a wide range of perceptually different conditions.
We propose an improved time-domain Blind Source Separation method and apply it to speech signal enhancement using multiple microphone recordings. The improvement consists in utilization of fuzzy clustering instead of a hard one, which is verified by experiments where real-world mixtures of two audio signals are separated from two microphones. Performance of the method is demonstrated by recognizing mixed and separated utterances from the Czech part of the European broadcast news database using our Czech LVCSR system. The separation allows significantly better recognition, e.g., by 32% when the jammer signal is a Gaussian noise and the input signal-to-noise ratio is 10dB.
ICA (Independent Component Analysis) can estimate unknown source signals from their mixtures under the assumption that the source signals are statistically independent. However, in a real environment, the separation performance is often deteriorated because the number of the source signals is different from that of the sensors. In this paper, we propose an estimation method for the number of the sources based on the joint distribution of the observed signals under two-sensor configuration. From several simulation results, it is found that the number of the sources is coincident to that of peaks in the histogram of the distribution. The proposed method can estimate the number of the sources even if it is larger than that of the observed signals.
We propose a novel speech enhancement technique based on the hypothesized Wiener filter (HWF) methodology. The proposed HWF algorithm selects a filter for enhancing the input noisy signal by first 'hypothesizing' a set of filters and then choosing the most appropriate one for the actual filtering. We show that the proposed HWF can intrinsically offer superior performance to conventional Wiener filtering (CWF) algorithms, which typically perform a selection of a filter based only on the noisy input signal which results in a sub-optimal choice of the filter. We present results showing the advantages of HWF based speech enhancement over CWF, particularly with respect to the baseline performances achievable by HWF and with respect to the type of clean frames used, namely, codebooks vs a large number of clean frames. We show the consistently better performance of HWF based speech enhancement (over CWF) in terms of spectral distortion at various input SNR levels.
To mitigate the performance limitations caused by the constant spectral order β in the traditional spectral subtraction methods, we previously presented an adaptive β-order generalized spectral subtraction (GSS) in which the spectral order β is updated in a heuristic way. In this paper, we propose a psychoacousticallymotivated adaptive β-order GSS, by considering that different frequency bands contribute different amounts to speech intelligibility (i.e., the band-importance function). Specifically, in this proposed adaptive β-order GSS, the tendency of spectral order β to change with the input local signal-to-noise ratio (SNR) is quantitatively approximated by a sigmoid function, which is derived through a data-driven optimization procedure by minimizing the intelligibility-weighted distance between the desired speech spectrum and its estimate. The inherent parameters of the sigmoid function are further optimized with the data-driven optimization procedure. Experimental results indicate that the proposed psychoacoustically-motivated adaptive β-order GSS yields great improvements over the traditional spectral subtraction methods with the intelligibility-weighted measures.
We formulate a two-stage Iterative Wiener filtering (IWF) approach to speech enhancement, bettering the performance of constrained IWF, reported in literature. The codebook constrained IWF (CCIWF) has been shown to be effective in achieving convergence of IWF in the presence of both stationary and non-stationary noise. To this, we include a second stage of unconstrained IWF and show that the speech enhancement performance can be improved in terms of average segmental SNR (SSNR), Itakura-Saito (IS) distance and Linear Prediction Coefficients (LPC) parameter coincidence. We also explore the tradeoff between the number of CCIWF iterations and the second stage IWF iterations.
Speech enhancement is widely used to improve the perceptual quality of noisy speech by suppressing the interfering ambient noise and is commonly evaluated via objective quality measures. Automatic speech recognition (ASR) systems also use such speech enhancement technologies in front-end to improve their noise robustness. If the objective measures have a high correlation with speech recognition accuracy, we can effectively predict the ASR performance according to objective quality measures in advance and flexibly optimize the enhancement algorithms in the stage of system design. Motivated by such idea, this paper investigates the correlation between the ASR performance and several traditional objective measures based on Aurora2 database. In the Experimental results the highest correlation coefficient, 0.962, is provided by weighted spectral slope measure (WSS).
In the modulation-filtering based speech enhancement method, noise suppression is achieved by bandpass filtering the temporal trajectories of the power spectrum. In the literature, some authors use the power spectrum directly for modulation filtering, while others use different compression functions for reducing the dynamic range of the power spectrum prior to its modulation filtering. This paper compares systematically different dynamic range compression functions applied to the power spectrum for speech enhancement. Subjective listening tests and objective measures are used to evaluate the quality as well as the intelligibility of the enhanced speech. The quality is measured objectively in terms of the Perceptual Estimation of Speech Quality (PESQ) measure and the intelligibility in terms of the Speech Transmission Index (STI) measure. It is found that P0.3333 (power spectrum raised to power 1/3) results in the highest speech quality and intelligibility.
In this paper, we investigate a long state vector Kalman filter for the enhancement of speech that has been corrupted by white and coloured noise. It has been reported in previous studies that a vector Kalman filter achieves better enhancement than the scalar Kalman filter and it is expected that by increasing the state vector length, one may improve the enhancement performance even further. However, any enhancement improvement that may result from an increase in state vector length is constrained by the typical use of short, non-overlapped speech frames, as the autocorrelation coefficient estimates tend to become less reliable at higher lags. We propose to overcome this problem by incorporating an analysismodification- synthesis framework, where long, overlapped frames are used instead. Our enhancement experiments based on the NOIZEUS corpus show that the proposed long state vector Kalman filter achieves higher mean SNR and PESQ scores than the scalar and short state vector Kalman filter, therefore fulfilling the notion that a longer state vector can lead to better enhancement.
Traditional subspace based speech enhancement (SSE) methods use linear minimum mean square error (LMMSE) estimation that is optimal if the Karhunen Loeve transform (KLT) coefficients of speech and noise are Gaussian distributed. In this paper, we investigate the use of Gaussian mixture (GM) density for modeling the non-Gaussian statistics of the clean speech KLT coefficients. Using Gaussian mixture model (GMM), the optimum minimum mean square error (MMSE) estimator is found to be nonlinear and the traditional LMMSE estimator is shown to be a special case. Experimental results show that the proposed method provides better enhancement performance than the traditional subspace based methods.
An improved version of the original parametric formulation of the generalized spectral subtraction method is presented in this study. The original formulation uses parameters that minimize the mean-square error (MSE) between the estimated and true speech spectral amplitudes. However, the MSE does not take into account any perceptual measure. We propose two new short-time spectral amplitude estimators based on a perceptual error criterion . the weighted Euclidean distortion. The error function is easily adaptable to penalize spectral peaks and valleys differently. Performance evaluations were performed using two noise types over four SNR levels and compared to the original parametric formulation. Results demonstrate that for most cases the proposed estimators achieve greater noise suppression without introducing speech distortion.
This paper describes a method for reducing sudden noise using noise detection and classification methods, and noise power estimation. Sudden noise detection and classification have been dealt with in our previous study. In this paper, noise classification is improved to classify more kinds of noises based on k-means clustering, and GMM-based noise reduction is performed using the detection and classification results. As a result of classification, we can determine the kind of noise we are dealing with, but the power is unknown. In this paper, this problem is solved by combining an estimation of noise power with the noise reduction method. In our experiments, the proposed method achieved good performance for recognition of utterances overlapped by sudden noises.
Speech enhancement methods using spectral subtraction have the drawback of generating an annoying residual noise with musical character. In this paper a frequency domain optimal linear estimator with perceptual post filtering is proposed which incorporates the masking properties of the human auditory system to make the residual noise distortion inaudible. The performance of the proposed enhancement algorithm is evaluated by the Segmental SNR, Log Spectral Distance (LSD) and Perceptual Evaluation of Speech Quality (PESQ) measures under various noisy environments and yields better results compared to the Wiener denoising technique.
We present a technique for denoising speech using temporally regularized nonnegative matrix factorization (NMF). In previous work [1], we used a regularized NMF update to impose structure within each audio frame. In this paper, we add frame-to-frame regularization across time and show that this additional regularization can also improve our speech denoising results. We evaluate our algorithm on a range of nonstationary noise types and outperform a state-of-the-art Wiener filter implementation.
This paper proposes a novel ICA-based MAP speech enhancement algorithm using multiple variable speech distribution models. The proposed algorithm consists of two stages, primary and advanced enhancement. The primary enhancement is performed by employing a single distribution model obtained from all speech signals. The advanced enhancement first employs multiple models of speech signals, each modeling a specific type of speech, and then adapts these model parameters for each speech frame by employing the enhanced signal from the primary estimation. A statistical noise adaptation technique has been employed to better model the noise in non-stationary case. The proposed algorithm has been evaluated on speech from TIMIT database corrupted by various noises and it has shown significantly improved performance over using the single speech distribution model.
We describe a system for separating multiple sources from a two-channel recording based on interaural cues and known characteristics of the source signals. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels. In simulated reverberant mixtures of three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 2.1 dB over a baseline algorithm using only interaural cues.
This paper presents an adaptive beamforming application based on the capture of far-field speech data from a real single speaker in a real meeting room. After the position of a speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so that the distribution of an output signal is as non-Gaussian as possible. We consider kurtosis in order to measure the degree of non-Gaussianity. Our beamforming algorithms can suppress noise and reverberation without the signal cancellation problems encountered in conventional beamforming algorithms. We demonstrate the effectiveness of our proposed techniques through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). The beamforming algorithm proposed here achieved a 13.6% WER, whereas the simple delay-and-sum beamformer provided a WER of 17.8%.
A noise robust dereverberation method is presented for speech enhancement in noisy reverberant conditions. This method introduces the constraint of minimizing the noise power in the inverse filter computation of dereverberation. It is shown that there exists a tradeoff between reducing the reverberation and reducing the noise; this tradeoff can be controlled by the constraint. Inverse filtering reduces early reflections and directional noise. In addition, spectral subtraction is used to suppress the tail of the inversefiltered reverberation and residual noise. The performance of our method is objectively and subjectively evaluated in experiments using measured room impulse responses. The results indicate that this method provides better speech quality than the conventional methods.
The performance of two-microphone coherence based methods degrades if two captured noises are correlated. The Cross Power Spectrum Subtraction (CPSS) is an adapted coherence method for noise correlated environments. In this paper, we propose a new technique for estimation of speech cross power spectrum density and we exploit it in CPSS. The proposed speech enhancement method is evaluated as a speech recognition preprocessing system and as an independent speech enhancement system. The enhancement results show the practical superiority of the proposed method comparing with the previous solutions.
Usage of cellular phones and small form factor devices as PDAs and other handhelds has been increasing rapidly. Their use is varied, with scenarios such as communication, internet browsing, audio and video recording just to name a few. This requires better sound capturing system as the sound source is already at larger distance from the device's microphone. In this paper we propose sound capture system for small devices which uses two unidirectional microphones placed back-to-back close to each other. The processing part consists of beamformer and a non-linear spatial filter. The speech enhancement processing achieves an improvement of 0.39 MOS points in the perceptual sound quality and 10.8 dB improvement in SNR.