| Total: 63
Telecommunications at home is changing rapidly. Many people have moved from the traditional PSTN phone to the mobile phone. Now for increasingly many people Voice-over-IP telephony on a PC platform is becoming the primary technology for voice communications. In this tutorial paper we give an overview of some of the current trends and try to characterize the next generation of home telephony, in particular, the concept of ambient telephony. We give an overview of the research challenges in the development of ambient telephone systems and introduce some potential solutions and scenarios.
In this paper, we describe a novel audio database recorded in home environments. The database contains continuous sounds from morning to evening, no matter what the subject is doing, although some utterances to invoke speech recognition are included in the data. It tells us how often speech interface is used, how speech interface is activated erroneously when it is not called, and how people speak when they really want to use speech recognition. The database also features parallel recording using microphone arrays, which is expected to improve the performance of speech/non-speech detection and speech recognition under noisy conditions. Preliminary experiments show that the speech/non-speech detection performance of the trigger-initiated activation system is relatively high, but that of the automatic activation system is not satisfactory. Adopting array-based and F0-based detection algorithms produces a slight rise of the precision/recall curve, but more research is necessary to realize a life with ubiquitous speech interface of home appliances, in which machines are always listening to you.
In this paper we investigate the problem of identifying and localizing speakers with distant microphone arrays, thus extending the classical speaker diarization task to answer the question "who spoke when
A method for estimating sound source distance in dynamic auditory ‘scenes’ using binaural data is presented. The technique requires little prior knowledge of the acoustic environment. It consists of feature extraction for two dynamic distance cues, motion parallax and acoustic τ, coupled with an inference framework for distance estimation. Sequential and non-sequential models are evaluated using simulated anechoic and reverberant spaces. Sequential approaches based on particle filtering more than halve the distance estimation error in all conditions relative to the non-sequential models. These results confirm the value of active behaviour and probabilistic reasoning in auditorily-inspired models of distance perception.
We propose a novel packetization and variable bitrate compression scheme for DSR source coding, based on the Group of Pictures concept from video coding. The proposed algorithm simultaneously packetizes and further compresses source coded features using the high interframe correlation of speech, and is compatible with a variety of VQ-based DSR source coders. The algorithm approximates vector quantizers as Markov Chains, and empirically trains the corresponding probability parameters. Feature frames are then compressed as I-frames, P-frames, or B-frames, using Huffman tables. The proposed scheme can perform lossless compression, but is also robust to lossy compression through VQ pruning or frame puncturing. To illustrate its effectiveness, we applied the proposed algorithm to the ETSI DSR source coder. The algorithm provided compression rates of up to 31.60% with negligible recognition accuracy degradation, and rates of up to 71.15% with performance degradation under 1.0%.
Channel selection is important for automatic speech recognition as the signal quality of one channel might be significantly better than those of the other channels and therefore, microphone array or blind source separation techniques might not lead to improvements over the best single microphone. The mayor challenge, however, is to find this particular channel who is leading to the most accurate classification. In this paper we present a novel channel selection method, based on class separability, to improve multi-source far distance speech-to-text transcriptions. Class separability measures have the advantage, compared to other methods such as the signal to noise ratio (SNR), that they are able to evaluate the channel quality on the actual features of the recognition system.
We present privacy-sensitive methods for (1) automatically finding multi-person conversations in spontaneous, situated speech data and (2) segmenting those conversations into speaker turns. The methods protect privacy through a feature set that is rich enough to capture conversational styles and dynamics, but not sufficient for reconstructing intelligible speech. Experimental results show that the conversation finding method outperforms earlier approaches and that the speaker segmentation method is a significant improvement to the only other known privacy-sensitive method for speaker segmentation.
The head orientation of human speakers in a smart-room affects the quality of the signals recorded by far-field microphones, and consequently influences the performance of the technologies deployed based on those signals. Additionally, knowing the orientation in these environments can be useful for the development of several multimodal advanced services, for instance, in microphone network management. Consequently, head orientation estimation has recently become a growing interesting research topic. In this paper, we propose two different approaches to head orientation estimation on the basis of multi-microphone recordings: first, an approach based on the generalization of the well-known SRP-PHAT speaker localization algorithm, and second a new approach based on measurements of the ratio between the high and the low band speech energies. Promising results are obtained in both cases, with a generalized better performance of the algorithms based on speaker localization methods.
In this paper we introduce soft features of variable resolution for robust distributed speech recognition over channels exhibiting packet losses. The underlying rationale is that lost feature vectors can never be reconstructed perfectly and therefore reconstruction is carried out at a lower resolution than the resolution of the originally sent features. By doing so, enormous reductions in computational effort can be achieved at a graceful or even no degradation in word accuracy. In experiments conducted on the Aurora II database we obtained for example a reduction of a factor of 30 in computation time for the reconstruction of the soft features without an effect on the word error rate. The proposed method is fully compatible with the ETSI DSR standard, as there are no changes involved in the front-end processing and the transmission format.
In this paper, we investigate the validity of the common assumption made in Wiener filtering that the clean speech and noise signals are uncorrelated under short-time analysis typically used for speech enhancement. In order to achieve this we have performed speech enhancement experiments, where speech corrupted by additive white Gaussian noise is enhanced by a Wiener filter designed in the time as well as the frequency domains. Results of oracle-style experiments confirm that the inclusion of the additivity assumption in Wiener filtering results in negligible degradation of enhanced speech quality. Informal listening tests show that the background noise resulting from time domain enhancement to be more tolerable than the background noise resulting from frequency domain framework.
Though spectral subtraction has widely been used for speech enhancement, the spectral order β set in spectral subtraction is generally fixed to some constants, resulting in the performance limitation to a certain degree. In this paper, we first analyze the performance of the β-order generalized spectral subtraction in terms of the gain function to highlight its dependence on the value of spectral order β. Based on the analysis results and considering the non-uniform effect of real-world noise on speech signal, we further propose an adaptive β-order generalized spectral subtraction in which the spectral order β is adaptively updated according to the signal-to-noise ratio in each critical band frame by frame as in a sigmoid function. Experimental results in various noise conditions illustrate the superiority of the proposed method with regard to the traditional spectral subtraction methods.
A phoneme class based speech enhancement algorithm is proposed that is derived from the family of constrained iterative enhancement schemes. The algorithm is a Rover based solution that overcomes three limitations of the iterative scheme. It removes the dependency of the terminating iteration, employs direct phoneme class constraints, and achieves suppression of audible noise. In the Rover scheme, the degraded utterance is partitioned into segments based on class, and class specific constraints are applied on each segment using a hard decision method. To alleviate the effect of hard decision errors, a GMM based maximum likelihood (ML) soft decision method is also introduced. Performance evaluation is done using Itakura-Saito, segSNR, and PESQ metrics for four noise types at two SNRs. It is shown that the proposed algorithm outperforms other baseline algorithms like Auto-LSP and log-MMSE for all noise types and levels and achieves a greater degree of consistency in improving quality for most phoneme classes.
This paper introduces a novel speech enhancement method based on Empirical Mode Decomposition (EMD) and soft-thresholding algorithms. A modified soft thresholding strategy is adapted to the intrinsic mode functions (IMF) of the noisy speech. Due to the characteristics of EMD, each obtained IMF of the noisy signal will have different noise and speech energy distribution, thus will have a different noise variance. Based on this specific noise variance, by applying the proposed thresholding algorithm to each IMF separately, it is possible to effectively extract the existing noise components. The experimental results suggest that the proposed method is significantly more effective in removing the noise components from the noisy speech signal compared to recently reported techniques. The significantly better SNR improvement and the speech quality prove the superiority of the proposed algorithm.
In this paper we present a low-complexity version of perceptually constrained signal subspace method (PCSS) for speech enhancement. An approximate solution is presented in a new form which provides perceptually optimal residual noise shaping. The proposed approach does not require a whitening transformation and is sub-optimal for coloured noise. A comparative evaluation of selected methods is performed using objective speech quality measures and informal listening tests. The results show that the approximate method outperforms conventional one and gives comparable results as the exact solution in common situations.
Quality assessment of speech enhancement systems is a nontrivial task, especially when (residual) noise and echo signal components occur. We present a signal separation scheme that allows for a detailed analysis of unknown speech enhancement systems in a black box test scenario. Our approach separates the speech, (residual) noise, and (residual) echo component of the speech enhancement system in the sending direction (uplink direction). This makes it possible to independently judge the speech degradation and the noise and echo attenuation/ degradation. While state of the art tests always try to judge the sending direction signal mixture, our new scheme allows a more reliable analysis in shorter time. It will be very useful for testing hands-free devices in practice as well as for testing speech enhancement algorithms in research and development.
Speech enhancement techniques using spectral subtraction have the drawback of generating an annoying musical noise. We develop a new post-processing method for reducing it in each critical-band. In the proposed technique, the difference between tonality coefficients of the noisy speech and the denoised one constitutes one step for detection. Next, using a modified Johnston masking threshold, we detect the so-called "critical-band musical noise". The reduction is simply done by undertaking the power spectral density of detected musical noise under the masking thresholds. Simulation results using different criteria are presented to validate proposed ideas and to show that enhanced speech is characterized by low distortion and inaudible musical noise.
When noise reduction (NR) and dynamic compression (CP) systems are concatenated in a hearing aid or in a cochlear implant we observe undesired interaction effects like the degradation of the global SNR. A reason for this might be that the optimization of the NR algorithm is performed with respect to the uncompressed clean speech only. In this contribution we propose an alternative approach which integrates the CP task in the derivation of the NR algorithm. By this we get novel MMSE and MAP optimal estimators for the compressed clean speech. An analysis of the behavior of the proposed solutions reveals that the differences to a serial concatenation of NR and CP are in general small. In case of the widely used MMSE log spectral amplitude (LSA) estimator [1] we show that the combined optimization is identical to a serial concatenation.
Most DFT domain based speech enhancement methods are dependent on an estimate of the noise power spectral density (PSD). For non-stationary noise sources it is desirable to estimate the noise PSD also in spectral regions where speech is present. In this paper a new method for noise tracking is presented, based on eigenvalue decompositions of correlation matrices that are constructed from time series of noisy DFT coefficients. The presented method can estimate the noise PSD at time-frequency points where both speech and noise are present. In comparison to state-of-the-art noise tracking algorithms the proposed algorithm reduces the estimation error between the estimated and the true noise PSD and improves segmental SNR when combined with an enhancement system with several dB.
In the design of speech systems, the primary focus is the speech oriented task with the secondary emphasis on sustaining performance under varying operating conditions. Variation in environmental conditions is one of the most important factors that impact speech system performance. In this study, we propose a framework for noise tracking. The proposed noise tracking algorithm is compared with Martin's [1] and Cohen's [2] estimation scheme's for speech enhancement in non-stationary noise conditions. The noise tracking scheme is evaluated over a corpus of three noise types including Babble (BAB), Large Crowd(LCR), and Machine Gun (MGN). The noise modeling scheme for tracking results in a measureable level of improvement for all the noise types (e.g., a 13.7% average relative improvement in Itakura-Saito(IS) measure over 9 noise conditions). This framework is therefore useful for speech applications requiring effective performance for non-stationary environments.
This paper presents a multi-reference noise reduction system for use in a vehicle. It is aimed at enhancing the driver's speech in noisy driving environments for applications such as voice commands or hands free phone. First, our system objective is briefly presented as well as the problems of classical techniques, like Beamforming, may face in such a harsh environment as a vehicle cabin. Second, a brief analysis of noises aboard a road vehicle is done. Third, the noise reduction architecture comprising a linear and non linear block with two respective non acoustic noise references is presented. Finally, some results obtained in real driving conditions are presented and analysed. Both human listening tests and speech recognition tests prove our system increases global performances compared to what is obtained with classical single channel speech processing methods.
For separating multiple speech signals given a convolutive mixture, time-frequency sparseness of the speech sources can be exploited. In this paper we present a multi-channel source separation method based on the concept of approximate disjoint orthogonality of speech signals. Unlike binary masking of single-channel signals as e.g. applied in the DUET algorithm we use a likelihood mask to control the adaptation of blind principal eigenvector beamformers. Furthermore orthogonal projection of the adapted beamformer filters leads to mutually orthogonal filter coefficients thus enhancing the demixing performance. Experimental results in terms of the achievable signal-to-interference ratio (SIR) and a perceptual speech quality measure are given for the proposed method and are compared to the DUET algorithm.
In this paper, a prototype of novel algorithm for blind separation of convolutive mixtures of audio sources is proposed. The method works in time-domain, and it is based on the recently very successful algorithm EFICA for Independent Component Analysis, which is an enhanced version of more famous FastICA. Performance of the new algorithm is very promising, at least, comparable to other (mostly frequency domain) algorithms. Audio separation examples are included.
Prior knowledge of familiar auditory patterns is essential for separating sound sources in human auditory processing. Speech recognition modeling is one probabilistic way for capturing these familiar auditory patterns. In this paper we focus on separating speech sources with a single-microphone input only. A model-based algorithm is proposed to generate target speech by estimating its spectral envelope trajectory and filtering irrelevant harmonic structure of the interference. The spectral trajectory is optimally regenerated in the form of line spectrum pair (LSP) parameters. Experiments on separating mixed speech sources are presented. Objective evaluation shows that interference is significantly reduced and the output speech is highly intelligible and sounds fairly clear.
A speech signal captured by a distant microphone is generally contaminated by reverberation and background noise, which severely degrade the automatic speech recognition (ASR) performance. In this paper, we first extend a previously proposed single channel dereverberation algorithm to a multi-channel scenario. The method estimates late reflections using multi-channel multi-step linear prediction, and then suppresses them in the power spectral domain. Second, we analyze the effect of additive noise on the proposed method and provide one solution to the noisy reverberant environment. Experimental results show that the proposed method achieves good dereverberation in noisy reverberant environments, and can significantly improve the ASR performance to that obtained for a non-reverberant environment.
In this paper, we propose a novel residual echo suppression (RES) algorithm constructed in the acoustic echo canceller. In the proposed approach, we introduce a statistical model to detect the signal components of the output signal and the state of signal is classified into four distinct hypothesis depending on the activity of near-end signal and residual echo. For hypothesis testing, the conventional likelihood ratio test is performed to make an optimal decision. The parameters specified in terms of the power spectral densities can be obtained by updating according to the hypothesis testing results and we can obtain the optimal RES filter by adopting the estimated parameters. The experimental results show that the proposed algorithm yields improved performance compared to that of the previous RES technique.