| Total: 158
The elastic spatial filter (ESF) proposed in recent years is a popular multi-channel speech enhancement front end based on deep neural network (DNN). It is suitable for real-time processing and has shown promising automatic speech recognition (ASR) results. However, the ESF only utilizes the knowledge of fixed beamforming, resulting in limited noise reduction capabilities. In this paper, we propose a DNN-based generalized sidelobe canceller (GSC) that can automatically track the target speaker’s direction in real time and use the blocking technique to generate reference noise signals to further reduce noise from the fixed beam pointing to the target direction. The coefficients in the proposed GSC are fully learnable and an ASR criterion is used to optimize the entire network. The 4-channel experiments show that the proposed GSC achieves a relative word error rate improvement of 27.0% compared to the raw observation, 20.6% compared to the oracle direction-based traditional GSC, 10.5% compared to the ESF and 7.9% compared to the oracle mask-based generalized eigenvalue (GEV) beamformer.
Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR). On the other hand, the minimum variance distortionless response (MVDR) beamformer with NN-predicted masks, although can significantly reduce speech distortions, has limited noise reduction capability. In this paper, we propose a multi-tap MVDR beamformer with complex-valued masks for speech separation and enhancement. Compared to the state-of-the-art NN-mask based MVDR beamformer, the multi-tap MVDR beamformer exploits the inter-frame correlation in addition to the inter-microphone correlation that is already utilized in prior arts. Further improvements include the replacement of the real-valued masks with the complex-valued masks and the joint training of the complex-mask NN. The evaluation on our multi-modal multi-channel target speech separation and enhancement platform demonstrates that our proposed multi-tap MVDR beamformer improves both the ASR accuracy and the perceptual speech quality against prior arts.
This paper proposes an online dual-microphone system for directional speech enhancement, which employs geometrically constrained independent vector analysis (IVA) based on the auxiliary function approach and vectorwise coordinate descent. Its offline version has recently been proposed and shown to outperform the conventional auxiliary function approach-based IVA (AuxIVA) thanks to the properly designed spatial constraints. We extend the offline algorithm to online by incorporating the autoregressive approximation of an auxiliary variable. Experimental evaluations revealed that the proposed online algorithm could work in real-time and achieved superior speech enhancement performance to online AuxIVA in both situations where a fixed target was interfered by a spatially stationary or dynamic interference.
The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model’s attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.
Use of omni-directional microphones is commonly assumed in the differential beamforming with uniform circular arrays. The conventional differential beamforming with omni-directional elements tends to suffer in low white-noise-gain (WNG) at the low frequencies and decrease of directivity factor (DF) at high frequencies. WNG measures the robustness of beamformer and DF evaluates the array performance in the presence of reverberation. The major contributions of this paper are as follows: First, we extends the existing work by presenting a new approach with the use of the directional microphone elements, and show clearly the connection between the conventional beamforming and the proposed beamforming. Second, a comparative study is made to show that the proposed approach brings about the noticeable improvement in WNG at the low frequencies and some improvement in DF at the high frequencies by exploiting an additional degree of freedom in the differential beamforming design. In addition, the beampattern appears more frequency-invariant than that of the conventional method. Third, we study how the proposed beamformer performs as the number of microphone elements and the radius of the array vary.
This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality and a tensor-train (TT) output layer on the top to reduce model parameters. We first derive a new upper bound on the generalization power of the convolutional neural network (CNN) based vector-to-vector regression models. Then, we provide experimental evidence on the Edinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at the expense of a small increment of model sizes. Besides, CNN-TT slightly outperforms the CNN counterpart by utilizing only 32% of the CNN model parameters. Besides, further performance improvement can be attained if the number of CNN-TT parameters is increased to 44% of the CNN model size. Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes.
Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.
Mentoring-reverse mentoring, which is a novel knowledge transfer framework for unsupervised learning, is introduced in multi-channel speech source separation. This framework aims to improve two different systems, which are referred to as a senior and a junior system, by mentoring each other. The senior system, which is composed of a neural separator and a statistical blind source separation (BSS) model, generates a pseudo-target signal. The junior system, which is composed of a neural separator and a post-filter, was constructed using teacher-student learning with the pseudo-target signal generated from the senior system i.e, imitating the output from the senior system (mentoring step). Then, the senior system can be improved by propagating the shared neural separator of the grown-up junior system to the senior system (reverse mentoring step). Since the improved neural separator can give better initial parameters for the statistical BSS model, the senior system can yield more accurate pseudo-target signals, leading to iterative improvement of the pseudo-target signal generator and the neural separator. Experimental comparisons conducted under the condition where mixture-clean parallel data are not available demonstrated that the proposed mentoring-reverse mentoring framework yielded improvements in speech source separation over the existing unsupervised source separation methods.
This paper proposes new blind signal processing techniques for optimizing a multi-input multi-output (MIMO) convolutional beamformer (CBF) in a computationally efficient way to simultaneously perform dereverberation and source separation. For effective CBF optimization, a conventional technique factorizes it into a multiple-target weighted prediction error (WPE) based dereverberation filter and a separation matrix. However, this technique requires the calculation of a huge spatio-temporal covariance matrix that reflects the statistics of all the sources, which makes the computational cost very high. For computationally efficient optimization, this paper introduces two techniques: one that decomposes the huge covariance matrix into ones for individual sources, and another that decomposes the CBF into sub-filters for estimating individual sources. Both techniques effectively and substantively reduce the size of the covariance matrices that must calculated, and allow us to greatly reduce the computational cost without loss of optimality.
We propose a space-and-speaker-aware iterative mask estimation (SSA-IME) approach to improving complex angular central Gaussian distributions (cACGMM) based beamforming in an iterative manner by leveraging upon the complementary information obtained from SSA-based regression. First, a mask calculated by beamformed speech features is proposed to enhance the estimation accuracy of the ideal ratio mask from noisy speech. Second, the outputs of cACGMM-beamformed speech with given time annotation as initial values are used to extract the log-power spectral and inter-phase difference features of different speakers serving as inputs to estimate the regression-based SSA model. Finally, in decoding, the mask estimated by the SSA model is also used to iteratively refine cACGMM-based masks, yielding enhanced multi-array speech. Tested on the recent CHiME-6 Challenge Track 1 tasks, the proposed SSA-IME framework significantly and consistently outperforms state-of-the-art approaches, and achieves the lowest word error rates for both Track 1 speech recognition tasks.
Characterizing precisely neurophysiological activity involved in natural conversations remains a major challenge. We explore in this paper the relationship between multimodal conversational behavior and brain activity during natural conversations. This is challenging due to Functional Magnetic Resonance Imaging (fMRI) time resolution and to the diversity of the recorded multimodal signals. We use a unique corpus including localized brain activity and behavior recorded during a fMRI experiment when several participants had natural conversations alternatively with a human and a conversational robot. The corpus includes fMRI responses as well as conversational signals that consist of synchronized raw audio and their transcripts, video and eye-tracking recordings. The proposed approach includes a first step to extract discrete neurophysiological time-series from functionally well defined brain areas, as well as behavioral time-series describing specific behaviors. Then, machine learning models are applied to predict neurophysiological time-series based on the extracted behavioral features. The results show promising prediction scores, and specific causal relationships are found between behaviors and the activity in functional brain areas for both conditions, i.e., human-human and human-robot conversations.
Reconstruction of speech envelope from neural signal is a general way to study neural entrainment, which helps to understand the neural mechanism underlying speech processing. Previous neural entrainment studies were mainly based on single-trial neural activities, and the reconstruction accuracy of speech envelope is not high enough, probably due to the interferences from diverse noises such as breath and heartbeat. Considering that such noises independently emerge in the consistent neural processing of the subjects responding to the same speech stimulus, we proposed a method to align and average electroencephalograph (EEG) signals of the subjects for the same stimuli to reduce the noises of neural signals. Pearson correlation of constructed speech envelops with the original ones showed a great improvement comparing to the single-trial based method. Our study improved the correlation coefficient in delta band from around 0.25 to 0.5, where 0.25 was obtained in previous leading studies based on single-trial. The speech tracking phenomenon not only occurred in the commonly reported delta and theta band, but also occurred in the gamma band of EEG. Moreover, the reconstruction accuracy for regular speech was higher than that for the time-reversed speech, suggesting that neural entrainment to natural speech envelope reflects speech semantics.
Alterations in speech and language are typical signs of mild cognitive impairment (MCI), considered to be the prodromal stage of Alzheimer’s disease (AD). Yet, very few studies have pointed out at what stage their speech production is disrupted. To bridge this knowledge gap, the present study focused on lexical retrieval, a specific process during speech production, and investigated how it is affected in cognitively impairment patients with the state-of-the-art analysis of brain functional network. 17 patients with MCI and 20 age-matched controls were invited to complete a primed picture naming task, of which the prime was either semantically related or unrelated to the target. Using electroencephalography (EEG) signals collected during task performance, even-related potentials (ERPs) were analyzed, together with the construction of the brain functional network. Results showed that whereas MCI patients did not exhibit significant differences in reaction time and ERP responses, their brain functional network did alter associated with a significant main effect in accuracy. The observation of increased cluster coefficients and characteristic path length indicated deteriorations in global information processing, which provided evidence that deficits in lexical retrieval might have occurred even at the preclinical stage of AD.
Listeners usually have the ability to selectively attend to the target speech while ignoring competing sounds. The mechanism that top-down attention modulates the cortical envelope tracking to speech was proposed to account for this ability. Additional visual input, such as lipreading was considered beneficial for speech perception, especially in noise. However, the effect of audiovisual (AV) congruency on the dynamic properties of cortical envelope tracking activities was not discussed explicitly. And the involvement of cortical regions processing AV speech was unclear. To solve these issues, electroencephalography (EEG) was recorded while participants attending to one talker from a mixture for several AV conditions (audio-only, congruent and incongruent). Approaches of temporal response functions (TRFs) and inter-trial phase coherence (ITPC) analysis were utilized to index the cortical envelope tracking for each condition. Comparing with the audio-only condition, both indices were enhanced only for the congruent AV condition, and the enhancement was prominent over both the auditory and visual cortex. In addition, timings of different cortical regions involved in cortical envelope tracking activities were subject to stimulus modality. The present work provides new insight into the neural mechanisms of auditory selective attention when visual input is available.
Human listeners can recognize target speech streams in complex auditory scenes. The cortical activities can robustly track the amplitude fluctuations of target speech with auditory attentional modulation under a range of signal-to-masker ratios (SMRs). The root-mean-square (RMS) level of the speech signal is a crucial acoustic cue for target speech perception. However, in most studies, the neural-tracking activities were analyzed with the intact speech temporal envelopes, ignoring the characteristic decoding features in different RMS-level-specific speech segments. This study aimed to explore the contributions of high- and middle-RMS-level segments to target speech decoding in noisy conditions based on electroencephalogram (EEG) signals. The target stimulus was mixed with a competing speaker at five SMRs (i.e., 6, 3, 0, -3, and -6 dB), and then the temporal response function (TRF) was used to analyze the relationship between neural responses and high/middle-RMS-level segments. Experimental results showed that target and ignored speech streams had significantly different TRF responses under conditions with the high- or middle-RMS-level segments. Besides, the high- and middle-RMS-level segments elicited different TRF responses in morphological distributions. These results suggested that distinct models could be used in different RMS-level-specific speech segments to better decode target speech with corresponding EEG signals.
Human speech processing, either for listening or oral reading, requires dynamic cortical activities that are not only driven by sensory stimuli externally but also influenced by semantic knowledge and speech planning goals internally. Each of these functions has been known to accompany specific rhythmic oscillations and be localized in distributed networks. The question is how the brain organizes these spatially and spectrally distinct functional networks in such a temporal precision that endows us with incredible speech abilities. For clarification, this study conducted an oral reading task with natural sentences and collected simultaneously the involved brain waves, eye movements, and speech signals with high-density EEG and eye movement equipment. By examining the regional oscillatory spectral perturbation and modeling the frequency-specific interregional connections, our results revealed a hierarchical oscillatory mechanism, in which gamma oscillation entrains with the fine-structured sensory input while beta oscillation modulated the sensory output. Alpha oscillation mediated between sensory perception and cognitive function via selective suppression. Theta oscillation synchronized local networks for large-scale coordination. Differing from a single function-frequency-correspondence, the coexistence of multi-frequency oscillations was found to be critical for local regions to communicate remotely and diversely in a larger network.
In processing behavioral data from auditory lexical decision, reaction times (RT) can be defined relative to stimulus onset or relative to stimulus offset. Using stimulus onset as the reference invokes models that assumes that relevant processing starts immediately, while stimulus offset invokes models that assume that relevant processing can only start when the acoustic input is complete. It is suggested that EEG recordings can be used to tear apart putative processes. EEG analysis requires some kind of time-locking of epochs, so that averaging of multiple signals does not mix up effects of different processes. However, in many lexical decision experiments the duration of the speech stimuli varies substantially. Consequently, processes tied to stimulus offset are not appropriately aligned and might get lost in the averaging process. In this paper we investigate whether the time course of putative processes such as phonetic encoding, lexical access and decision making can be derived from ERPs and from instantaneous power representations in several frequency bands when epochs are time-locked at stimulus onset or stimulus offset. In addition, we investigate whether time-locking at the moment when the response is given can shed light on the decision process per sé.
Between 15% to 40% of mild traumatic brain injury (mTBI) patients experience incomplete recoveries or provide subjective reports of decreased motor abilities, despite a clinically-determined complete recovery. This demonstrates a need for objective measures capable of detecting subclinical residual mTBI, particularly in return-to-duty decisions for warfighters and return-to-play decisions for athletes. In this paper, we utilize features from recordings of directed speech and gait tasks completed by ten healthy controls and eleven subjects with lingering subclinical impairments from an mTBI. We hypothesize that decreased coordination and precision during fine motor movements governing speech production (articulation, phonation, and respiration), as well as during gross motor movements governing gait, can be effective indicators of subclinical mTBI. Decreases in coordination are measured from correlations of vocal acoustic feature time series and torso acceleration time series. We apply eigenspectra derived from these correlations to machine learning models to discriminate between the two subject groups. The fusion of correlation features derived from acoustic and gait time series achieve an AUC of 0.98. This highlights the potential of using the combination of vocal acoustic features from speech tasks and torso acceleration during a simple gait task as a rapid screening tool for subclinical mTBI.1
ICE-Talk is an open source1 web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.
Speech provides an intuitive interface to communicate with machines. Today, developers willing to implement such an interface must either rely on third-party proprietary software or become experts in speech recognition. Conversely, researchers in speech recognition wishing to demonstrate their results need to be familiar with technologies that are not relevant to their research (e.g., graphical user interface libraries). In this demo, we introduce Kaldi-web1: an open-source, cross-platform tool which bridges this gap by providing a user interface built around the online decoder of the Kaldi toolkit. Additionally, because we compile Kaldi to Web Assembly, speech recognition is performed directly in web browsers. This addresses privacy issues as no data is transmitted to the network for speech recognition.
SoapBox Labs’ child speech verification platform is a service designed specifically for identifying keywords and phrases in children’s speech. Given an audio file containing children’s speech and one or more target keywords or phrases, the system will return the confidence score of recognition for the word(s) or phrase(s) within the the audio file. The confidence scores are provided at utterance level, word level and phoneme level. The service is available online through an cloud API service, or offline on Android and iOS. The platform is accurate for child speech from children as young as 3, and is robust to noisy environments. In this demonstration we show how to access the online API and give some examples of common use cases in literacy and language learning, gaming and robotics.
The SoapBox Labs Fluency API service allows the automatic assessment of a child’s reading fluency. The system uses automatic speech recognition (ASR) to transcribe the child’s speech as they read a passage. The ASR output is then compared to the text of the reading passage, and the fluency algorithm returns information about the accuracy of the child’s reading attempt. In this show and tell paper we describe how the fluency cloud API is accessed and demonstrate how the fluency demo system processes an audio file, as shown in the accompanying video.
We present Catotron, a neural network-based open-source speech synthesis system in Catalan. Catotron consists of a sequence-to-sequence model trained with two small open-source datasets based on semi-spontaneous and read speech. We demonstrate how a neural TTS can be built for languages with limited resources using found-data optimization and cross-lingual transfer learning. We make the datasets, initial models and source code publicly available for both commercial and research purposes.
We demonstrate a multimodal conversational platform for remote patient diagnosis and monitoring. The platform engages patients in an interactive dialog session and automatically computes metrics relevant to speech acoustics and articulation, oro-motor and oro-facial movement, cognitive function and respiratory function. The dialog session includes a selection of exercises that have been widely used in both speech language pathology research as well as clinical practice — an oral motor exam, sustained phonation, diadochokinesis, read speech, spontaneous speech, spirometry, picture description, emotion elicitation and other cognitive tasks. Finally, the system automatically computes speech, video, cognitive and respiratory biomarkers that have been shown to be useful in capturing various aspects of speech motor function and neurological health and visualizes them in a user-friendly dashboard.
We proposed a novel AI framework to conduct real-time multi-speaker recognition without any prior registration or pretraining by learning the speaker identification on the fly. We considered the practical problem of online learning with episodically revealed rewards and introduced a solution based on semi-supervised and self-supervised learning methods in a web-based application at https://www.baihan.nyc/viz/VoiceID/