| Total: 68
We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary timefrequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting T-F masks are used in conjunction with missing-data methods for recognition. Systematic evaluations on a speech separation challenge task show significant improvement over the baseline performance.
This paper introduces a speech separation system as a front-end processing step for automatic speech recognition (ASR). It employs computational auditory scene analysis (CASA) to separate the target speech from the interference speech. Specifically, the mixed speech is preprocessed based on auditory peripheral model. Then a pitch tracking is conducted and the dominant pitch is used as a main cue to find the target speech. Next, the time frequency (TF) units are merged into many segments. These segments are then combined into streams via CASA initial grouping. A regrouping strategy is employed to refine these streams via amplitude modulate (AM) cues, which are finally organized by the speaker recognition techniques into corresponding speakers. Finally, the output streams are reconstructed to compensate the missing data in the abovementioned processing steps by a cluster based feature reconstruction. Experimental results of ASR show that at low TMR (<-6dB) the proposed method offers significantly higher recognition accuracy.
For pitch tracking of a single speaker, a common requirement is to find the optimal path through a set of voiced or voiceless pitch estimates over a sequence of time frames. Dynamic programming (DP) algorithms have been applied before to this problem. Here, the pitch candidates are provided by a multi-channel autocorrelation-based estimator, and DP is extended to pitch tracking of multiple concurrent speakers. We use the resulting pitch information to enhance harmonic content in noisy speech and to obtain separations of target from interfering speech.
This paper addresses the problem of recognising speech in the presence of a competing speaker. We employ a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of spectro-temporal fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper reports recent advances in this technique, and presents an evaluation based on artificially mixed speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance whereby performance increases as the target-masker ratio is reduced below -3 dB.
This paper proposes an algorithm for the recognition and separation of speech signals in non-stationary noise, such as another speaker. We present a method to combine hidden Markov models (HMMs) trained for the speech and noise into a factorial HMM to model the mixture signal. Robustness is obtained by separating the speech and noise signals in a feature domain, which discards unnecessary information. We use mel-cepstral coefficients (MFCCs) as features, and estimate the distribution of mixture MFCCs from the distributions of the target speech and noise. A decoding algorithm is proposed for finding the state transition paths and estimating gains for the speech and noise from a mixture signal. Simulations were carried out using speech material where two speakers were mixed at various levels, and even for high noise level (9 dB above the speech level), the method produced relatively good (60% word recognition accuracy) results. Audio demonstrations are available at www.cs.tut.fi/~tuomasv.
This paper considers the recognition of speech given in the form of two mixed sentences, spoken by the same talker or by two different talkers. The database published on the ICSLP2006 website for Two-Talker Speech Separation Challenge is used in the study. A system that recognizes and reconstructs both sentences from the given mixture is described. The system involves a combination of several different techniques, including a missing-feature approach for improving crosstalk/noise robustness, Wiener filtering for speech restoration, HMM-based speech reconstruction, and speakerdependent/- independent modeling for speaker/speech recognition. For clean speech recognition, the system obtained a word accuracy rate 96.7%. For the two-talker speech separation challenge task, the system obtained 81.4% at 6 dB TMR (target-to-masker ratio) and 34.1% at -9 dB TMR.
We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate temporal dynamics. We report on two methods for introducing dynamics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the separation performance. Once the signals have been separated they are then recognized using speaker dependent labeling.
In this work, we present a single-channel speech enhancement technique called the Modified Phase Opponency (MPO) model as a solution to the Speech Separation Challenge. The MPO model is based on a neural model for detection of tones-in-noise called the Phase Opponency (PO) model. Replacing the noisy speech signals by the corresponding MPO-processed signals increases the accuracy by 31% when the speech signals are corrupted by speech-shaped noise at 0 dB Signal-to-Noise Ratio (SNR). It is worth mentioning that the MPO enhancement scheme was developed using the noisy connected-digit Aurora database and was not tailored in any way to fit the Grid database used in this challenge. One of the salient features of the MPO-based speech enhancement scheme is that it does not need to estimate the noise characteristics, nor does it assume that the noise satisfies any statistical model.
In this paper, an improved preprocessor for low-bit-rate speech coding employing the perceptual weighting filter is proposed. Speech modification in the proposed approach is performed according to a criterion which makes a compromise between the modification and perceptual weighted quantization errors. For this, the perceptual weighting filter is expressed in terms of a transform domain matrix. The proposed approach is effective in enhancing the speech signal at coder-decoder (CODEC) output through a number of listening tests.
Conventional dynamic codebook reordering can often significantly enhance the achievable compression efficiency in simple one-stage vector quantization. When applied to more advanced quantizer structures, such as multi-stage vector quantizers, the performance of the technique becomes worse. This paper describes in detail an enhanced approach for dynamic codebook reordering that improves the performance by taking into account the whole quantizer structure. The significant efficiency improvements provided by the proposed approach are demonstrated in practical experiments. Though the results presented in this paper relate to a speech storage system, the proposed approach can also be employed more widely in compression applications that keep the encoder and the decoder in synchrony.
This paper proposes a novel segment-based speech coding algorithm to efficiently compress the database for concatenative text-to-speech (TTS) systems. To achieve a high compression ratio and meet the fundamental requirements of concatenative TTS synthesizers, i.e. partial segment decoding and random access capability, we adopt a modified analysis-by-synthesis scheme. The spectral coefficients are quantized by a length-based interpolation method and excitation signals are modeled with both non-predictive and predictive approaches. Considering that pitch pulse waveforms of a specific speaker show low intra-variation, the conventional adaptive codebook for pitch prediction is replaced by a speaker dependent pitch-pulse codebook. By applying the proposed algorithm to a hand-held Korean TTS system, we verify that the proposed coder provides a compression ratio of about 1/13, a low complexity of around 1.2 WMOPS, and random access capability.
We propose a unified framework for segment quantization of speech at ultra low bit-rates of 150 bits/sec based on unit-selection principle using a modified one-pass dynamic programming algorithm. The algorithm handles both fixed- and variable- length units in a unified manner, thereby providing a generalization over two existing unit selection methods, which deal with single-frame and segmental units differently. The proposed algorithm performs unit-selection based quantization directly on the units of a continuous codebook, thereby not incurring any of the sub-optimalities of the existing segmental algorithm. Moreover, the existing single-frame algorithm becomes a special case of the proposed algorithm. Based on the rate-distortion performance on a multi-speaker database, we show that fixed length units of 6-8 frames perform significantly better than single-frame units and offer similar spectral distortions as variable-length phonetic units, thereby circumventing expensive segmentation and labeling of a continuous database for unit selection based low bit-rate coding.
This paper describes new efficient Vector Quantization (VQ) techniques that enable low complexity implementations of VQ-based Noise Feedback Coding (NFC). These methods offer mathematical equivalence to higher complexity methods. Furthermore, the paper presents efficient codec structures for general noise shaping as used in BroadVoice 16 - a new SCTE (Society of Cable Telecommunications Engineers) and PacketCable speech coding standard for Cable Telephony.
Comfort noise insertion during speech pause has been applied to Voice-over-IP and wireless networks for increasing bandwidth efficiency. We present two classified comfort noise generation (CCNG) schemes using Gaussian Mixture classifiers (GMM-C). Our first scheme employs a classified prototype background noise codebook with the prototype noise waveform chosen using a GMM-C. The second scheme utilizes a classified enhanced excitation codebook. The new CCNG algorithms provide better comfort noise during speech pauses and a smaller misclassification rate. We have retrofitted the scheme into existing speech transmission system, such as ITU-T G.711/Appendix II and G.723.1/Annex A. The perceived quality of a voice conversation of the novel system has been noticeably enhanced for car and babble noise. For the G.711 system, a large improvement is obtained for car noise while the largest amelioration is for babble noise in the G.723.1 system.
This paper describes the CELP coding module within the Adaptive Rate-Distortion Optimized sound codeR (ARDOR). The ARDOR codec combines coding techniques of different nature using a rate-distortion control mechanism, and is able to adapt to a large range of signal characteristics and system constraints. The implemented CELP codec is derived from the 3GPP AMR-WB codec. Adaptations were necessary to match the ARDOR structure constraints and several new features have been added to improve the codec performance in this context. Listening test results are given to illustrate the behavior of the final codec compared to state-of-the-art coders.
We investigate the use of a two stage transform vector quantizer (TSTVQ) for coding of line spectral frequency (LSF) parameters in wideband speech coding. The first stage quantizer of TSTVQ, provides better matching of source distribution and the second stage quantizer provides additional coding gain through using an individual cluster specific decorrelating transform and variance normalization. Further coding gain is shown to be achieved by exploiting the slow time-varying nature of speech spectra and thus using inter-frame cluster continuity (ICC) property in the first stage of TSTVQ method. The proposed method saves 3-4 bits and reduces the computational complexity by 58-66%, compared to the traditional split vector quantizer (SVQ), but at the expense of 1.5-2.5 times of memory.
Further improvement in performance, to achieve near transparent quality LSF quantization, is shown to be possible by using a higher order two dimensional (2-D) prediction in the coefficient domain. The prediction is performed in a closed-loop manner so that the LSF reconstruction error is the same as the quantization error of the prediction residual. We show that an optimum 2-D predictor, exploiting both inter-frame and intra-frame correlations, performs better than existing predictive methods. Computationally efficient split vector quantization technique is used to implement the proposed 2-D prediction based method. We show further improvement in performance by using weighted Euclidean distance.
We propose a blind speech watermarking algorithm which allows high-rate embedding of digital side information into speech signals. We exploit the fact that the well-known LPC vocoder works very well for unvoiced speech. Using an auto-correlation based pitch tracking algorithm, a voiced/unvoiced segmentation is carried out. In the unvoiced segments, the linear prediction residual is replaced by a data sequence. This substitution does not cause perceptual degradation as long as the residuals power is matched. The signal is resynthesised using the unmodified LPC filter coefficients. The watermark is decoded by a linear prediction analysis of the received signal and the information is extracted from the sign of the residual. The watermark is nearly imperceptible and provides a channel capacity of up to 2000 bit/s in an 8 kHz-sampled speech signal.
The concealment procedure used by CELP speech decoders to regenerate lost frames introduces an error that propagates into the following frames. Within the context of voice transmission over packet networks, some packets arrive too late to be decoded and must also be concealed. Once they arrive however, those packets can be used to update the internal state of the decoder, which stops error propagation. Yet, care must be taken to ensure a smooth transition between the concealed frame and the following "updated" frame computed with properly updated internal states. During voiced or quasi-periodic segments, the pitch phase error that is generally introduced by the concealment procedure makes it difficult and detrimental to quality to use the traditional fade-in, fade-out approach. This paper presents a method to handle that pitch phase error. Specifically, the transition is done in such a way that the natural pitch periodicity of the speech signal is not broken.
Data-driven speech enhancement (Fingscheidt and Suhadi [1]) aims at improving speech quality for voice calls in a specific noise environment. The essence of the method are a set of frequency-dependent weighting rules, indexed by a priori and a posteriori SNRs, which are learned from clean speech and background noise training data. The weighting rules must be stored for each frequency bin separately and take up about 400 kBytes memory, which makes DSP implementations relatively expensive.
Most noisy speech enhancement methods result in partial suppression and distortion of speech spectrum. At instances when the local signal-to-noise ratio at a frequency band is very low speech partials are often obliterated. In this paper a method for enhancement and restoration of noisy speech based on a harmonic-noise model (HNM) is introduced. A HNM imposes a temporal-spectral structure that may reduce processing artifacts. The restoration process is enhanced through incorporation of a prior HNM of clean speech stored in a pre-trained codebook. The restored speech is a SNR-dependent combination of the de-noised observation and the speech obtained from weighted codebook mapping. The additional improvements of speech quality resulting from the proposed method in comparison to conventional and modern speech enhancement systems are evaluated. The results show that the proposed method improves the quality of noisy speech and restores much of the information lost to noise.
We consider DFT based techniques for single-channel speech enhancement. Specifically, we derive minimum mean-square error estimators of clean speech DFT coefficients based on generalized gamma prior probability density functions. Our estimators contain as special cases the well-known Wiener estimator and the more recently derived estimators based on Laplacian and two-sided gamma priors. Simulation experiments with speech signals degraded by various additive noise sources verify that the estimator based on the two-sided gamma prior is close to optimal amongst all the estimators considered in this paper.
Laptop computers are increasingly being used as recording devices to capture meetings, interviews, and lectures using the laptops local microphone. In these scenarios, the user frequently also uses the same laptop to make notes. Because of the close proximity of the laptops microphone to its keyboard, the captured speech signal is significantly corrupted by the impulsive sounds the users keystrokes generate. In this paper we propose an algorithm to automatically detect and remove keystrokes from a recorded speech signal. The detection and removal stages both operate by exploiting the natural correlations present in speech signals, but do so in different ways. The proposed algorithm is computationally efficient, requires no userspecific training or enrolment, and results in significantly enhanced speech. The proposed keystroke removal algorithm was evaluated through user listening tests and speech recognition experiments on speech recordings made in a realistic environment.
We present a simple yet effective algorithm for noise reduction of speech signals using a lattice LP filter. Based on previous investigations and a theoretical analysis of the lattice filter parameter estimation we introduce an improved parameter estimation algorithm that takes into account the non-stationary nature of speech and expected noise signals, yielding a good suppression of stationary and slowly timevarying noise. The algorithm has zero delay for the speech signal, promoting its application for telephony or hearing aids. No additional or explicit noise estimation algorithm is needed.
We previously presented a single-channel speech enhancement technique called the Modified Phase Opponency (MPO) model. The MPO model is based on a neural model called the Phase Opponency (PO) model. The efficacy of the MPO model was demonstrated on speech signals corrupted by additive white noise. In the present work, we extend the MPO model to perform efficiently on speech signals corrupted by additive colored noise with time varying spectral characteristics and amplitude levels. The MPO enhancement scheme outperforms many of the statistical and signal-theoretic speech enhancement techniques when evaluated using three different objective quality measures on a subset of the Aurora database. The superiority of the MPO speech enhancement scheme in enhancing speech signals when the amplitude level and the spectral characteristics of the background noise are fluctuating is also demonstrated.