| Total: 108

A speech signal captured by a distant microphone is generally contaminated by reverberation, which severely degrades the audible quality and intelligibility of the observed speech. In this paper, we investigate the single channel dereverberation which has been considered as one of the most challenging tasks. We propose an example-based speech enhancement approach used in combination with non-example-based (conventional) blind dereverberation algorithm, that would complement each other. The term, example-based, refers to the method which has exact (not brief and statistical) information about the clean speech as its model. It is important to note that the combination of two algorithms is formulated utilizing the uncertainty decoding technique, thereby achieving the smooth and theoretical interconnection. Experimental results show that the proposed method achieves better dereverberation in severe reverberant environments than the conventional methods in terms of objective quality measures.

Single-channel late reverberation suppression algorithms need estimates of the late reverberance spectral variance (LRSV) in order to suppress the late reverberance. Often the LRSV estimators are derived from a statistical room impulse response (RIR) model. Usually the late reverberation is modeled as a white Gaussian noise sequence with exponentially decaying variance. The whiteness assumption means that the same decay constant is assumed for all frequencies. Since there is generally more absorption of sound energy with increasing frequency, there is a need for RIR models that take this into account. We propose a new statistical time-varying RIR model that consists of a sum of decaying cosine functions with random phases, with a frequency dependent decay constant. We show that the resulting LRSV estimators have the same form as existing ones, but with an inherent frequency dependency of the decay constant. Experiments with real measured RIRs, however, indicate that, for the purpose of reverberation suppression, using a frequency independent decay constant is often sufficiently good. A common assumption in the derivation of LRSV estimators is that the direct signal and early reflections are uncorrelated with the late reverberation. We verify this assumption experimentally on measured RIRs and conclude that it is accurate.

The effect of ideal time-frequency masking (ITFM) on the intelligibility of reverberated speech is tested using objective measurement, namely STI and PESQ scores. The best choice of ITFM threshold is determined for a range of reverberation times (RTs). Four existing dereverberation algorithms are also assessed. Objective test results and informal subjective listening show that IFTM provides great intelligibility improvement for all RTs and outperforms the existing dereverberation algorithms, one of which assumes perfect knowledge of the room impulse response. While ITFM provides only a best possible performance bound, our results demonstrate the potential improvement that could be obtained using time-frequency masking for speech dereverberation.

This paper presents three effective proposals for a two-stage algorithm for one-microphone reverberant speech enhancement. The original algorithm is divided into two blocks: one that deals with the coloration effect, due to the early reflections, and the other for reducing the long-term reverberation. The proposed modifications consider changing the linear-prediction model order, the adaptation stepsize and stop criterion for the first algorithm stage. All the modifications are evaluated by a perceptual-quality measure specific for the speech-reverberation context. Experimental results for a 200-signal database show that the proposed improvements yielded an increase of 12% in perceptual measure and a reduction of about 96% in the computation cost when compared to the original framework.

In this work, we present a model-based Wiener filter whose frequency response is optimized in the dimensionally reduced log-Mel domain. That is achieved by making use of a reasonably novel speech feature enhancement approach that has originally been developed in the area of speech recognition. Its combination with Wiener filtering is motivated by the fact that signal reconstruction from log-Mel features sounds very unnatural. Hence, we correct only the spectral envelope and preserve the fine spectral structure of the noisy signal. Experiments on a Wall Street Journal corpus showed a relative improvement of up to 24% relative in PESQ and 45% relative in log spectral distance (LSD), compared to Ephraim and Mallah's log spectral amplitude estimator.

Binaural hearing aids include a wireless link to exchange the signals received at each side, allowing the implementation of more efficient noise-reduction algorithms for hostile environments such as babble noise. Although several binaural noise-reduction techniques have been proposed in the literature, only a few of them preserve localization cues of the target and interfering signals simultaneously without degrading the SNR improvement. This paper proposes a novel binaural noise-reduction method based on blind source separation (BSS) and a perceptual post-processing technique. Objective and subjective tests under four different scenarios were performed. The method showed good output sound quality, high SNR improvement at very low input SNR conditions, and preservation of localization cues for the signal and noise - outperforming both an existing BSS-based method and a multichannel Wiener filter (MWF).

In this paper, we propose a structure-generalized parametric blind spatial subtraction array (BSSA), and the theoretical analysis of the amounts of musical noise and speech distortion is conducted via higher-order statistics. We theoretically prove a tradeoff between the amounts of musical noise and speech distortion in various BSSA structures. From the analysis and experimental evaluations, we reveal that the structure should be carefully selected according to the application, i.e., a channel-wise BSSA structure is recommended for listening but a normal BSSA is more suitable for speech recognition.

Speakers appear to adopt strategies to improve speech intelligibility for interlocutors in adverse acoustic conditions. Generated speech, whether synthetic, recorded or live, may also benefit from context-sensitive modifications in challenging situations. The current study measured the effect on intelligibility of six spectral and temporal modifications operating under global constraints of constant input-output energy and duration. Reallocation of energy from mid-frequency regions with high local SNR produced the largest intelligibility benefits, while other approaches such as pause insertion or maintenance of a constant segmental SNR actually led to a deterioration in intelligibility. Listener scores correlated only moderately well with recent objective intelligibility estimators, suggesting that further development of intelligibility models is required to improve predictions for modified speech.

The goal of speech enhancement algorithms is to provide an estimate of clean speech starting from noisy observations. In general, the estimate is obtained by minimizing a chosen distortion metric. The often-employed cost is the mean-square error (MSE), which results in a Wiener-filter solution. Since the ground truth is not available in practice, the practical utility of the optimal estimators is limited. Alternative, one can optimize an unbiased estimate of the MSE. This is the key idea behind Stein's unbiased risk estimation (SURE) principle. Within this framework, we derive SURE solutions for the MSE and Itakura-Saito (IS) distortion measures. We also propose parametric versions of the corresponding SURE estimators, which give additional flexibility in controlling the attenuation characteristics for maximum signal-to-noise-ratio (SNR) gain. We compare the performance of the two distortion measures in terms of attenuation profiles, average segmental SNR, global SNR, and spectrograms. We also include a comparison with the standard power spectral subtraction technique. The results show that the IS distortion consistently gives better performance gain in all respects. The perceived quality of the enhanced speech is also better in case of the IS metric.

Various speech enhancement techniques (e.g. noise suppression, dereverberation) rely on the knowledge of the statistics of the clean signal and the noise process. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. With this respect, subspace based approaches have shown to allow for reduced estimation delay and perform a good tracking vs. final misadjustment tradeoff [1,2]. For an accurate noise non-stationarity tracking, these schemes have the challenge to estimate the correlation matrix of the observed signal from a limited number of samples. In this paper, we investigate the effect of the covariance estimation artifacts on the noise PSD tracking. We show that the estimation downsides could be alleviated using an appropriate selection scheme.

This paper examines whether non-acoustic noise reference signals can provide accurate estimates of noise at very low signal-to-noise ratios (SNRs) where conventional estimation methods are less effective. The environment chosen for the investigation is Formula 1 motor racing where SNRs are as low as -15dB and the non-acoustic reference signals are engine speed, road speed and throttle measurements. Noise is found to relate closely to these reference signals and a maximum a posteriori method (MAP) is proposed to estimate airflow and tyre noise from these parameters. Objective tests show MAP estimation to be more accurate than a range of conventional noise estimation methods. Subjective listening tests then compare speech enhancement using the proposed MAP estimation to conventional methods with the former found to give significantly higher speech quality.

In this paper, to achieve high-quality speech enhancement, we introduce the generalized minimum mean-square error short-time spectral amplitude estimator with a new blind prior estimation of the speech probability density function (p.d.f.). To deal with various types of speech signals with different p.d.f., we propose an algorithm of speech kurtosis estimation based on moment-cumulant transformation for blind adaptation to the shape parameter of speech p.d.f. From the objective and subjective evaluation experiments, we show the improved noise reduction performance of the proposed method.

This paper reports the results of a comparative study on blind speech separation (BSS) of two types of convolutive mixtures. The separation criterion is based on Frequency Oriented Principal Components Analysis (FOPCA). This method is compared to two other well-known methods: the Degenerate Unmixing Evaluation Technique (DUET) and Convolutive Fast Independent Component Analysis (C-FICA). The efficiency of FOPCA is exploited to derive a BSS algorithm for the under-determined case (more speakers than microphones). The FOPCA method is objectively compared in terms of signal-to-interference ratio (SIR) and the Perceptual Evaluation of Speech Quality (PESQ) criteria and subjectively by the Mean Opinion Score (MOS). Usually, the conventional algorithms in the frequency domain are subject to permutation problems. On the other hand, the proposed algorithm has the attractive feature that this inconvenience usually arising does not occur.

Methods for Blind Source Separation (BSS) aim at recovering signals from their mixture without prior knowledge about the signals and the mixing system. Among others, they provide tools for enhancing speech signals when they are disturbed by unknown noise or other interfering signals in the mixture. This paper considers a recent time-domain BSS method that is based on a complete decomposition of a signal subspace into components that should be independent. The components are used to reconstruct images of original signals using an ad hoc weighting, which influences the final performance of the method markedly. We propose a novel weighting scheme that utilizes block-Toeplitz structure of signal matrices and relies thus on an established property. We provide experiments with blind speech separation and speech recognition that prove the better performance of the modified BSS method.

The blind speech separation of convolutive mixtures can be performed in the time-frequency domain. The separation problem becomes to a set of instantaneous mixing problems, one for each frequency bin, that can be solved independently by any appropriate instantaneous ICA algorithm. However, the arbitrary order of the estimated sources in each frequency, known as permutation problem, has to be solved to successfully recover the original sources. This paper deals with the permutation problem in the general case of N sources and N observations. The proposed method combines a correlation approach based on the amplitude correlation property of speech signals, and an optimal pairing scheme to align the permuted solutions. Our method is robust to artificially permuted speech signals. Experimental results on simulated convolutive mixtures show the effectiveness of the proposed method in terms of quality of separated signals by objective and perceptual measures.

This paper introduces a speaker adaptation algorithm for nonnegative matrix factorization (NMF) models. The proposed adaptation algorithm is a combination of Bayesian and subspace model adaptation. The adapted model is used to separate speech signal from a background music signal in a single record. Training speech data for multiple speakers is used with NMF to train a set of basis vectors as a general model for speech signals. The probabilistic interpretation of NMF is used to achieve Bayesian adaptation to adjust the general model with respect to the actual properties of the speech signals that is observed in the mixed signal. The Bayesian adapted model is adapted again by a linear transform, which changes the subspace that the Bayesian adapted model spans to better match the speech signal that is in the mixed signal. The experimental results show that combining Bayesian with linear transform adaptation improves the separation results.

In two previous papers, we proposed an audio Informed Source Separation (ISS) system which can achieve the separation of I > 2 musical sources from linear instantaneous stationary stereo (2-channel) mixtures, based on audio signal's natural sparsity, pre-mix source signals analysis, and side-information embedding (within the mix signal). In the present paper and for the first time, we apply this system to mixtures of (up to seven) simultaneous speech signals. Compared to the reference MPEG-4 Spatial Audio Object Coding system, our system provides much cleaner separated speech signals (consistently 10.20 dB higher Signal to Interference Ratios), revealing strong potential for audio conference applications.

This paper tackles the speech separation problem in a meeting room using a new acoustic beamforming method - adaptive blocking (AB) beamformer. The proposed method is an optimum beamforming with a structure similar to the generalized sidelobe canceller (GSC) structure, but simpler. Thus, it inherits the flexibility of GSC and functions well in dynamic environments. We investigate the performance of the proposed method through different experiments and compare the results with a GSC beamformer for minimum variance distortionless response (MVDR). The experimental setups include one wanted speaker, two interferers, air conditioner noise and uncorrelated sensor noise. AB provides improvement over MVDR-GSC.

This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of speaker space. In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. In the EVC, similarly to speaker recognition approaches, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this paper, we revisit construction of the speaker space by introducing the tensor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the Gaussian component and the dimension of the mean vector, and the speaker space is derived by the tensor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. Experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.

The GMM based mapping techniques proved to be an efficient method to find nonlinear regression function between two spaces, and found success in voice conversion. In these methods, a linear transformation is estimated for each Gaussian component, and the final conversion function is a weighted summation of all linear transformations. These linear transformations fit well for the samples near to the center of at least one Gaussian component, but may not deal well with the samples far from the centers of all Gaussian distributions. To overcome this problem, this paper proposes Bag of Gaussian Model (BGM). BGM model consists of two types of Gaussian distributions, namely basic and complex distributions. Compared with classical GMM, BGM is adaptive for samples. That is for a sample, BGM can select a set of Gaussian distributions which fit the sample best. We develop a data-driven method to construct BGM model and show how to estimate regression function with BGM. We carry out experiment on voice conversion tasks. The experimental results exhibit the usefulness of BGM based methods.

A spectral conversion method using multiple Gaussian Mixture Models (GMMs) based on the Bayesian framework is proposed. A typical spectral conversion framework is based on a GMM. However, in this conventional method, a GMM-appropriate number of mixtures is dependent on the amount of training data, and thus the number of mixtures should be determined beforehand. In the proposed method, the variational Bayesian approach is applied to GMM-based voice conversion, and multiple GMMs are integrated as a single statistical model. Appropriate model structures are stochastically selected for each frame based on the Bayesian frame work.

Common voice conversion systems employ a spectral / time domain mapping to convert speech from one speaker to another. The speech quality of conversion methods does not sound natural because the spectral / time domain patterns of two speakers' speech do not match completely. In this paper we propose a method that uses inter-frame (dynamic) characteristics in addition to intra-frame characteristics to find the converted speech frames. This method is based on VQ and uses a trellis structure to find the best conversion function. The proposed method provides high quality converted voice, low computational complexity and small trained model size in contrast to other common methods. Subjective and objective evaluations are employed to demonstrate the superiority of the proposed method over the VQ-based and GMM-based methods.

The goal of voice conversion is to transform a sentence said by one speaker, to sound as if another speaker had said it. The classical conversion based on a Gaussian Mixture Model and several other schemes suggested since, produce muffled sounding outputs, due to excessive smoothing of the spectral envelopes. To reduce the muffling effect, enhancement of the Global Variance (GV) of the spectral features was recently suggested. We propose a different approach for GV enhancement, based on the classical conversion formalized as a GV-constrained minimization. Listening tests show that an improvement in quality is achieved by the proposed approach.

Dynamic Frequency Warping (DFW) offers an appealing alternative to GMM-based voice conversion, which suffers from "over-smoothing" that hinders speech quality. However, to adjust spectral power after DFW, previous work returns to GMMtransformation. This paper proposes a more effective DFW with amplitude scaling (DFWA) that functions on the acoustic class level and is independent of GMM-transformation. The amplitude scaling compares average target and warped source log amplitude spectra for each class. DFWA outperforms the GMM in terms of both speech quality and timbre conversion, as confirmed in objective and subjective testing. Moreover, DFWA performance is equivalent using parallel or nonparallel corpora.

This paper describes an artificial bandwidth extension (ABE) method that generates new high frequency components to a narrowband signal by folding specifically gained subbands to frequencies from 4 kHz to 7 kHz, and improves the quality and intelligibility of narrowband speech in mobile devices. The proposed algorithm was evaluated by subjective listening tests. In addition, rarely used conversation test was constructed. Speech quality of 1) narrowband phone call, 2) wideband phone call, and 3) narrowband phone call enhanced with ABE were evaluated in conversational context using mobile devices with integrated hands-free (IHF) functionality. The results indicate that in IHF use case, ABE quality overcomes narrowband speech quality both in car noise and in quiet environment.