| Total: 62
We previously have applied deep autoencoder (DAE) for noise reduction and speech enhancement. However, the DAE was trained using only clean speech. In this study, by using noisy-clean training pairs, we further introduce a denoising process in learning the DAE. In training the DAE, we still adopt greedy layer-wised pretraining plus fine tuning strategy. In pretraining, each layer is trained as a one-hidden-layer neural autoencoder (AE) using noisy-clean speech pairs as input and output (or transformed noisy-clean speech pairs by preceding AEs). Fine tuning was done by stacking all AEs with pretrained parameters for initialization. The trained DAE is used as a filter for speech estimation when noisy speech is given. Speech enhancement experiments were done to examine the performance of the trained denoising DAE. Noise reduction, speech distortion, and perceptual evaluation of speech quality (PESQ) criteria are used in the performance evaluations. Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given. In addition, compared with a minimum mean square error based speech enhancement algorithm, our proposed denoising DAE provided superior performance on the three objective evaluations.
In this study, we perform a theoretical analysis of the amount of musical noise generated in Bayesian minimum mean-square error speech amplitude estimators. In our previous study, a musical noise assessment based on kurtosis has been successfully applied to spectral subtraction. However, it is difficult to apply this approach to the methods with a decision-directed a priori SNR estimator because it corresponds to a nonlinear recursive process for noise power spectral sequences. Therefore, in this paper, we analyze musical noise generation by combining Breithaupt-Martinfs approximation and our higher-order-statistics analysis. We also compare the result of theoretical analysis and that of objective experimental evaluation to indicate the validity of the proposed closed-form analysis.
This paper investigates a non-negative matrix factorization (NMF)- based approach to the semi-supervised single-channel speech enhancement problem where only non-stationary additive noise signals are given. The proposed method relies on sinusoidal model of speech production which is integrated inside NMF framework using linear constraints on dictionary atoms. This method is further developed to regularize harmonic amplitudes. Simple multiplicative algorithms are presented. The experimental evaluation was made on TIMIT corpus mixed with various types of noise. It has been shown that the proposed method outperforms some of the state-of-the-art noise suppression techniques in terms of signal-to-noise ratio.
In this paper, we consider the single-channel speech enhancement problem, in which a clean speech signal needs to be estimated from a noisy observation. To capture the characteristics of both the noise and speech signals, we combine the well-known Short-Time- Spectrum-Amplitude (STSA) estimator with a machine learning based technique called Multi-frame Sparse Dictionary Learning (MSDL). The former utilizes statistical information for denoising, while the latter helps better preserve speech, especially its temporal structure. The proposed algorithm, named STSA-MSDL, outperforms standard statistical algorithms such as the Wiener filter, STSA estimator, as well as dictionary based algorithms when applied to the TIMIT database, using four different objective metrics that measure speech intelligibility, speech distortion, background noise reduction, and the overall quality.
A novel method for speech enhancement based on Convolutive Non-negative Matrix Factorization (CNMF) is presented in this paper. The sparsity of activation matrix for speech components has already been utilized in NMF-based enhancement methods. However such methods do not usually take into account prior knowledge about occurrence relations between different speech components. By introducing the notion of cosparsity, we demonstrate how such relations can be characterized from available speech data and enforced when recovering speech from noisy mixtures. Through objective evaluations we show our proposed regularization improves sparse reconstruction of speech, especially in low SNR conditions.
Stochastic-deterministic (SD) speech modelling exploits the predictability of speech components that may be regarded deterministic. This has recently been employed in speech enhancement resulting in an improved recovery of deterministic speech components, although the improvement achieved is largely dependant on how these components are estimated. In this paper we propose a joint SD Wiener filtering scheme that exploits the predictability of sinusoidal components in speech. Estimation of sinusoidal speech components is approached in the recursive Bayesian context, where the linearity of the joint SD Wiener filter and Gaussian assumptions suggest a Kalman filtering scheme for the estimation of sinusoidal components. A further refinement also imposes a restriction of a smooth spectral envelope on sinusoidal magnitude estimates. The resulting joint SD Wiener filtering scheme improves speech quality in terms of the perceptual evaluation of speech quality (PESQ) metric when compared to both the traditional Wiener filter and the proposed Wiener filter based on alternative estimates of deterministic speech components.
In this work, we introduce a new discriminative training method for nonnegative dictionary learning. The new method can be used in single channel source separation (SCSS) applications. In SCSS, nonnegative matrix factorization (NMF) is used to learn a dictionary (a set of basis vectors) for each source in the magnitude spectrum domain. The trained dictionaries are then used in decomposing the mixed signal to find the estimate for each source. Learning discriminative dictionaries for the source signals can improve the separation performance. To achieve discriminative dictionaries, we try to avoid the bases set of one source dictionary from representing the other source signals. We propose to minimize cross-coherence between the dictionaries of all sources in the mixed signal. We incorporate a simplified cross-coherence penalty using a regularized NMF cost function to simultaneously learn discriminative and reconstructive dictionaries. The new regularized NMF update rules that are used to discriminatively train the dictionaries are introduced in this work. Experimental results show that using discriminative training gives better separation results than using conventional NMF.
We propose a novel method of pitch track correction that uses an ensemble Kalman filter to improve the performance of monaural speech segregation. The proposed method considers all reliable pitch streaks for pitch track correction, whereas the conventional segregation approach relies on only the longest streak in a given speech stream. In addition, unreliable pitch streaks are corrected with an ensemble Kalman filter that uses autocorrelation functions as noisy observations for the hidden true pitch values. Our proposed approach provides more accurate pitch estimation, thus improving speech segregation performance for various types of noises, in particular, colored noise. In speech segregation experiments on mixtures of speech and various competing noises, the proposed method demonstrated superior performance to the conventional approach.
A simple and low computational complexity system for bi-speaker speech separation is proposed in this paper. The system is constructed of a voice activity classification (VAC) module and an adaptive bi-beamformer module for speech separation using a microphone array. The first module identifies active speaker(s) and allows the system to control the adaptation of the second module automatically. The VAC is based on a novel classification method containing two steps. The first step uses a robust VAC method based on our previous work on beamformer-output-ratio of a bi-beamforming system. The second step refines the VAC results using a novel method derived from an analytical result on the output power of an adaptive beamformer. The system is tested in reverberant environments with both synthesized and real recordings. The synthesized recordings contain two speakers, a background speech and noises. The real recording contains two speakers speaking spontaneously. The VAC results satisfy a conservative classification scheme to avoid the signal cancellation problem. The final separation outputs are compared with the ideal outputs provided by genie-aided adaptive beamformers which have perfect VAC knowledge. The results show that the propose automatic system achieves high performance close to the ideal system.
Distributed microphone array (DMA) processing has recently been gathering increasing research interest due to its various applications and diverse challenges. In many conventional multi-channel speech enhancement algorithms that use co-located microphones, such as the multi-channel Wiener filtering and mask-based blind source separation (BSS) approaches, statistics of the target and interference signals are required if we are to design an optimal enhancement filter. To obtain such statistics, we estimate activity information regarding source and interference signals (hereafter, source activity information), that is generally assumed to be common to all the microphones. However, in DMA scenarios, the source activities observable at any given microphone may be significantly different from those of others when the microphones are spatially distributed to a great degree, and the level of each signal at each microphone varies significantly. Thus, to capture such source activity information appropriately and thereby achieve optimal speech enhancement in DMA environments, in this paper we propose an approach for estimating microphone-dependent source activity, and for performing blind source separation based on such information. The proposed method estimates the activity of each source signal at each microphone, which can be explained by the microphone-independent speech log power spectra and microphone-location dependent source gains. We introduce a probabilistic formulation of the proposed method, and an efficient algorithm for model parameter estimation. We show the efficacy of the proposed method experimentally in comparison with conventional methods in various DMA scenarios.
This paper proposes an algorithm for separating monaural audio signals by non-negative tensor factorisation of modulation spectrograms. The modulation spectrogram is able to represent redundant patterns across frequency with similar features, and the tensor factorisation is able to isolate these patterns in an unsupervised way. The method overcomes the limitation of conventional nonnegative matrix factorisation algorithms to utilise the redundancy of sounds in frequency. In the proposed method, separated sounds are synthesised by filtering the mixture signal with a Wiener-like filter generated from the estimated tensor factors. The proposed method was compared to conventional algorithms in unsupervised separation of mixtures of speech and music. Improved signal to distortion ratios were obtained compared to standard non-negative matrix factorisation and non-negative matrix deconvolution.
Partial phase reconstruction based on a confidence domain has recently been shown to provide improved signal reconstruction performance in a single-channel source separation scenario. In this paper, we replace the previous binarized fixed-threshold confidence domain with a new signal-dependent one estimated by employing a sinusoidal model to be applied on the estimated magnitude spectrum of the underlying sources in the mixture. We also extend the sinusoidal-based confidence domain into Multiple Input Spectrogram Inversion (MISI) framework, and we propose to re-distribute the remixing error at each iteration on the sinusoidal-signal components. Our experiments on both oracle and estimated spectra show that the proposed method achieves improved separation results at a lower number of iterations, making it as a favorable choice for faster phase estimation.
The present paper addresses multiparty telephone conferences with asymmetric quality degradations. We propose a systematic method that allows to investigate how individual technical degradations can lead to the perception of quality impairments by the different interlocutors in a conference call. By conducting this analysis for a number of degradations, a detailed picture on the complexity of assessing asymmetric conditions is drawn, which in turn verifies the need for such strictly systematic assessment approaches.
We propose a method for estimating user activities by analyzing long-term (more than several seconds) acoustic signals represented as acoustic event temporal sequences. The proposed method is based on a probabilistic generative model of an acoustic event temporal sequence that is associated with user activities (e.g. "cooking") and subordinate categories of user activities (e.g. "fry ingredients" or "plate food") in which each user activity is represented as a probability distribution over unsupervised subordinate categories of user activities called activity-topics, and each activity-topic is represented as a probability distribution over acoustic events. This probabilistic generative model can express user activities that have more than one subordinate category of the user activities, which a model that takes into account only user activities cannot express adequately. User activity estimation with this model is achieved using a two-step process: frame-by-frame acoustic event estimation to output an acoustic event temporal sequence and user activity estimation with the proposed probabilistic generative model. Activity estimation experiments with real-life sounds indicated that the proposed method improved user activity estimation accuracy and stability of "unseen" acoustic event temporal sequences. In addition, the experiment showed that the proposed method could extract correct subordinate categories of user activities.
In previous work, we proposed a model for speech-to-speech translation that is sensitive to paralinguistic information such as duration and power of spoken words. This model uses linear regression to map source acoustic features to target acoustic features directly and in continuous space. However, while the model is effective, it faces scalability issues as a single model must be trained for every word, which makes it difficult to generalize to words for which we do not have parallel speech. In this work we first demonstrate that simply training a linear regression model on all words is not sufficient to express paralinguistic translation. We next describe a neural network model that has sufficient expressive power to perform paralinguistic translation with a single model. We evaluate the proposed method on a digit translation task and show that we achieve similar results with a single neural network-based model as previous work did using word-dependent models.
Speech translation (ST) systems consist of three major components: automatic speech recognition (ASR), machine translation (MT), and speech synthesis (SS). In general the ASR system is tuned independently to minimize word error rate (WER), but previous research has shown that ASR and MT can be jointly optimized to improve translation quality. Independently, many techniques have recently been proposed for the optimization of MT, such as empirical comparison of joint optimization using minimum error rate training (MERT), pairwise ranking optimization (PRO) and the batch margin infused relaxed algorithm (MIRA). The first contribution of this paper is an empirical comparison of these techniques in the context of joint optimization. As the last two methods are able to use sparse features, we also introduce lexicalized features using the frequencies of recognized words. In addition, motivated by initial results, we propose a hybrid optimization method that changes the translation evaluation measure depending on the features to be optimized. Experimental results for the best combination of algorithm and features show a gain of 1.3 BLEU points at 27% of the computational cost of previous joint optimization methods.
This paper proposes a new method for automatically detecting disfluencies in spontaneous speech . specifically, self-corrections . that explicitly models repetitions vs. other disfluencies. We show that, in a corpus of Supreme Court oral arguments, repetition disfluencies can be longer and more stutter-like than the short repetitions observed in the Switchboard corpus and suggest that they can be better represented with a flat structure that covers the full sequence. Since these disfluencies are relatively easy to detect, weakly supervised training is an effective way to minimize labeling costs. By explicitly modeling these, we improve general disfluency detection within and across domains, and we provide a richer transcript.
This paper focuses on the identification of disfluent sequences and their distinct structural regions, based on acoustic and prosodic features. Reported experiments are based on a corpus of university lectures in European Portuguese, with roughly 32h, and a relatively high percentage of disfluencies (7.6%). The set of features automatically extracted from the corpus proved to be discriminant of the regions contained in the production of a disfluency. Several machine learning methods have been applied, but the best results were achieved using Classification and Regression Trees (CART). The set of features which was most informative for cross-region identification encompasses word duration ratios, word confidence score, silent ratios, and pitch and energy slopes. Features such as the number of phones and syllables per word proved to be more useful for the identification of the interregnum, whereas energy slopes were most suited for identifying the interruption point.
The comparison of human speech recognition (HSR) and machine performance allows to learn from the differences between HSR and automatic speech recognition (ASR) and serves as motivation for using auditory-inspired strategies in ASR. The recognition of noisy digit strings from the Aurora 2 framework is one of the most widely used tasks in the ASR community. This paper establishes a baseline with a close-to-optimal classifier, i.e., our auditory system by comparing results from 10 normal-hearing listeners to the Aurora 2 reference system using identical speech material. The baseline ASR system reaches the human performance level only when the signal-to-noise ratio is increased by 10 or 21 dB depending on the training condition. The recognition of 1-digit recordings was found to be considerably better for HSR, indicating that onset detection is an important feature neglected in standard ASR systems. Results of recent studies are considered in the light of these findings to measure how far we have come on the way to human speech recognition performance.
Retrieving information from the ever-increasing amount of unannotated audio and video recordings requires techniques such as unsupervised pattern discovery or query-by-example. In this paper we focus on queries that are specified in the form of an audio snippet containing the desired word or expression excised from the target recordings. The task is to retrieve all-and-only the instances whose match score with the query meet an absolute criterion. For this purpose we introduce a distance measure between two acoustic vectors that can be calibrated in a completely unsupervised manner. The use of that measure also allows the use of a fast matching approach, which makes it possible to skip more than 97% of full-fledged DTW without affecting performance in terms of precision and recall. We demonstrate the effectiveness of the proposals with query-by-example experiments conducted on a read speech corpus for English and a spontaneous speech corpus for Dutch.
The amount of multimedia data is increasing every day and there is a growing demand for high-accuracy multimedia retrieval systems that go beyond retrieving simple events (e.g., detecting a sport video), to more specific and hard-to-detect events (e.g., a point in a tennis match). To retrieve these complex events, audio content features play an important role since they provide complementary information to image/video features. In this paper, we propose a novel approach where we employ an HMM-based acoustic concept recognition (ACR) system and convert resulting recognition lattices into acoustic concept indexes to represent multimedia audio content. Lattice indexes are created by extracting posterior-weighted N-gram counts from the ACR lattices and they are used as features in SVM-based classification for multimedia event detection (MED) task. We evaluate the proposed approach on the NIST 2011 TRECVID MED development set, which consists of user-generated videos from the internet. Proposed approach yields an Equal Error Rate (EER) of 31.6% on this acoustically challenging dataset (on a set of 5 video events) outperforming previously proposed supervised and unsupervised approaches on the same dataset (34.5% and 36.9% respectively).
This paper presents a collection of tools (and adaptors for existing tools) that we have recently developed, which support acquisition, annotation and analysis of multimodal corpora. For acquisition, an extensible architecture is offered that integrates various sensors, based on existing connectors (e.g. for motion capturing via VICON, or ART) and on connectors we contribute (for motion tracking via Microsoft Kinect as well as eye tracking via Seeingmachines FaceLAB 5). The architecture provides live visualisation of the multimodal data in a unified virtual reality (VR) view (using Fraunhofer Instant Reality) for control during recordings, and enables recording of synchronised streams. For annotation, we provide a connection between the annotation tool ELAN (MPI Nijmegen) and the VR visualisation. For analysis, we provide routines in the programming language Python that read in and manipulate (aggregate, transform, plot, analyse) the sensor data, as well as text annotation formats (Praat TextGrids). Use of this toolset in multimodal studies proved to be efficient and effective, as we discuss. We make the collection available as open source for use by other researchers.
We propose an alternative evaluation metric to Word Error Rate (WER) for the decision audit task of meeting recordings, which exemplifies how to evaluate speech recognition within a legitimate application context. Using machine learning on an initial seed of human-subject experimental data, our alternative metric handily outperforms WER, which correlates very poorly with human subjectsf success in finding decisions given ASR transcripts with a range of WERs.
Real-time speech-to-speech (S2S) translation of lectures and speeches require simultaneous translation with low latency to continually engage the listeners. However, simultaneous speech-tospeech translation systems have been predominantly repurposing translation models that are typically trained for consecutive translation without a motivated attempt to model incrementality. Furthermore, the notion of translation is simplified to translation plus simultaneity. In contrast, human interpreters are able to perform simultaneous interpretation by generating target speech incrementally with very low ear-voice span by using a variety of strategies such as compression (paraphrasing), incremental comprehension, and anticipation through discourse inference and expectation of discourse redundancies. Exploiting and modeling such phenomena can potentially improve automatic real-time translation of speech. As a first step, in this work we identify and present a systematic analysis of phenomena used by human interpreters to perform simultaneous interpretation and elucidate how it can be exploited in a conventional simultaneous translation framework. We perform our study on a corpus of simultaneous interpretation of Parliamentary speeches in English and Spanish. Specifically, we present an empirical analysis of factors such as time constraint, redundancy and inference as evidenced in the simultaneous interpretation corpus.
We present a real-time automatic speech translation system for university lectures that can interpret several lectures in parallel. University lectures are characterized by a multitude of diverse topics and a large amount of technical terms. This poses specific challenges, e.g., a very specific vocabulary and language model are needed. In addition, in order to be able to translate simultaneously, i.e., to interpret the lectures, the components of the systems need special modifications. The output of the system is delivered in the form of real-time subtitles via a web site that can be accessed by the students attending the lecture through mobile phones, tablet computers or laptops. We evaluated the system on our German to English lecture translation task at the Karlsruhe Institute of Technology. The system is now being installed in several lecture halls at KIT and is able to provide the translation to the students in several parallel sessions.