| Total: 115
This paper investigates the use of time-domain convolutional denoising autoencoders (TCDAEs) with multiple channels as a method of speech enhancement. In general, denoising autoencoders (DAEs), deep learning systems that map noise-corrupted into clean waveforms, have been shown to generate high-quality signals while working in the time domain without the intermediate stage of phase modeling. Convolutional DAEs are one of the popular structures which learns a mapping between noise-corrupted and clean waveforms with convolutional denoising autoencoder. Multi-channel signals for TCDAEs are promising because the different times of arrival of a signal can be directly processed with their convolutional structure, Up to this time, TCDAEs have only been applied to single-channel signals. This paper explorers the effectiveness of TCDAEs in a multi-channel configuration. A multi-channel TCDAEs are evaluated on multi-channel speech enhancement experiments, yielding significant improvement over single-channel DAEs in terms of signal-to-distortion ratio, perceptual evaluation of speech quality (PESQ), and word error rate.
Using multiple microphones for speech enhancement allows for exploiting spatial information for improved performance. In most cases, the spatial filter is selected to be a linear function of the input as, for example, the minimum variance distortionless response (MVDR) beamformer. For non-Gaussian distributed noise, however, the minimum mean square error (MMSE) optimal spatial filter may be nonlinear. Potentially, such nonlinear functional relationships could be learned by deep neural networks. However, the performance would depend on many parameters and the architecture of the neural network. Therefore, in this paper, we more generally analyze the potential benefit of nonlinear spatial filters as a function of the multivariate kurtosis of the noise distribution. The results imply that using a nonlinear spatial filter is only worth the effort if the noise data follows a distribution with a multivariate kurtosis that is considerably higher than for a Gaussian. In this case, we report a performance difference of up to 2.6 dB segmental signal-to-noise ratio (SNR) improvement for artificial stationary noise. We observe an advantage of 1.2dB for the nonlinear spatial filter over the linear one even for real-world noise data from the CHiME-3 dataset given oracle data for parameter estimation.
This paper deals with multi-channel speech recognition in scenarios with multiple speakers. Recently, the spectral characteristics of a target speaker, extracted from an adaptation utterance, have been used to guide a neural network mask estimator to focus on that speaker. In this work we present two variants of speaker-aware neural networks, which exploit both spectral and spatial information to allow better discrimination between target and interfering speakers. Thus, we introduce either a spatial pre-processing prior to the mask estimation or a spatial plus spectral speaker characterization block whose output is directly fed into the neural mask estimator. The target speaker’s spectral and spatial signature is extracted from an adaptation utterance recorded at the beginning of a session. We further adapt the architecture for low-latency processing by means of block-online beamforming that recursively updates the signal statistics. Experimental results show that the additional spatial information clearly improves source extraction, in particular in the same-gender case, and that our proposal achieves state-of-the-art performance in terms of distortion reduction and recognition accuracy.
In this paper, we present a practical implementation of the parametric multi-channel Wiener filter (PMWF) noise reduction algorithm. In particular, we extend on methods that incorporate the multi-channel speech presence probability (MC-SPP) in the PMWF derivation and its output. The use of the MC-SPP brings several advantages. Firstly, the MC-SPP allows for better estimates of noise and speech statistics, for which we derive a direct update of the inverse of the noise power spectral density (PSD). Secondly, the MC-SPP is used to control the trade-off parameter in PMWF which, with proper tuning, outperforms the traditional approach with a fixed trade-off parameter. Thirdly, the MC-SPP for each frequency-band is used to obtain the MMSE estimate of the desired speech signal at the output, where we control the maximum amount of noise reduction based on our application. Experimental results on a large number of simulated scenarios show significant benefits of employing MC-SPP in terms of SNR improvements and speech distortion.
In this paper, we propose a multi-channel speech dereverberation method which can reduce reverberation even when acoustic transfer functions (ATFs) are time varying under noisy environments. The microphone input signal is modeled as a convolutive mixture in a time-frequency domain so as to incorporate late reverberation whose tap length is longer than frame size of short term Fourier transform. To reduce reverberation effectively under the time-varying ATF conditions, the proposed method extends the deterministic convolutive transfer function (D-CTF) into a probabilistic convolutive transfer function (P-CTF). A variational Bayesian framework was applied to approximation of a joint posterior probability density functions of a speech source signal and the ATFs. Variational posterior probability density functions and the other parameters are iteratively updated so as to maximize an evidence lower bound (ELBO). Experimental results when the ATFs are time-varying and there is background noise showed that the proposed method can reduce reverberation more accurately than the Weighted Prediction error (WPE) and the Kalman-EM for dereverberation (KEMD).
This article presents frame-by-frame online processing algorithms for a Weighted Power minimization Distortionless response convolutional beamformer (WPD). The WPD unifies widely-used multichannel dereverberation and denoising methods, namely a weighted prediction error based dereverberation method (WPE) and a minimum power distortionless response beamformer (MPDR) into a single convolutional beamformer, and achieves simultaneous dereverberation and denoising based on maximum likelihood estimation. We derive two different online algorithms, one based on frame-by-frame recursive updating of the spatio-temporal covariance matrix of the captured signal, and the other on recursive least square estimation of the convolutional beamformer. In addition, for both algorithms, the desired signal’s relative transfer function (RTF) is estimated by online processing using a neural network based online mask estimation. Experiments using the REVERB challenge dataset show the effectiveness of both algorithms in terms of objective speech enhancement measures and automatic speech recognition (ASR) performance.
In this paper, we suggest a novel way to train Generative Adversarial Network (GAN) for the purpose of non-parallel, many-to-many voice conversion. The goal of voice conversion (VC) is to transform speech from a source speaker to that of a target speaker without changing the phonetic contents. Based on ideas from Game Theory, we suggest to multiply the gradient of the Generator with suitable weights. Weights are calculated so that they increase the power of fake samples that fool the Discriminator resulting in a stronger Generator. Motivated by a recently presented GAN based approach for VC, StarGAN-VC, we suggest a variation to StarGAN, referred to as Weighted StarGAN (WeStarGAN). The experiments are conducted on standard CMU ARCTIC database. WeStarGAN-VC approach achieves significantly better relative performance and is clearly preferred over recently proposed StarGAN-VC method in terms of speech subjective quality and speaker similarity with 75% and 65% preference scores, respectively.
Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers. However, such model suffers from the limitation that it can only convert the voice to the speakers in the training data, which narrows down the applicable scenario of VC. In this paper, we proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source and target speaker do not even need to be seen during training. This is achieved by disentangling speaker and content representations with instance normalization (IN). Objective and subjective evaluation shows that our model is able to generate the voice similar to target speaker. In addition to the performance measurement, we also demonstrate that this model is able to learn meaningful speaker representations without any supervision.
Building a voice conversion (VC) system for a new target speaker typically requires a large amount of speech data from the target speaker. This paper investigates a method to build a VC system for arbitrary target speaker using one given utterance without any adaptation training process. Inspired by global style tokens (GSTs), which recently has been shown to be effective in controlling the style of synthetic speech, we propose the use of global speaker embeddings (GSEs) to control the conversion target of the VC system. Speaker-independent phonetic posteriorgrams (PPGs) are employed as the local condition input to a conditional WaveNet synthesizer for waveform generation of the target speaker. Meanwhile, spectrograms are extracted from the given utterance and fed into a reference encoder, the generated reference embedding is then employed as attention query to the GSEs to produce the speaker embedding, which is employed as the global condition input to the WaveNet synthesizer to control the generated waveform’s speaker identity. In experiments, when compared with an adaptation training based any-to-any VC system, the proposed GSEs based VC approach performs equally well or better in both speech naturalness and speaker similarity, with apparently higher flexibility to the comparison.
In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational auto-encoder (CycleVAE)-based spectral modeling. In a variational autoencoder (VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAE-based VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. However, there is still a gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner. We evaluated our methods on non-parallel multi-speaker VC. An objective evaluation demonstrates that our proposed methods improve speech quality in terms of both global and local structure measures. Furthermore, a subjective evaluation shows that StarGAN-VC2 outperforms StarGAN-VC in terms of naturalness and speaker similarity.
This paper presents an investigation of the robustness of statistical voice conversion (VC) under noisy environments. To develop various VC applications, such as augmented vocal production and augmented speech production, it is necessary to handle noisy input speech because some background sounds, such as external noise and an accompanying sound, usually exist in a real environment. In this paper, we investigate an impact of the background sounds on the conversion performance in singing voice conversion focusing on two main VC frameworks, 1) vocoder-based VC and 2) vocoder-free VC based on direct waveform modification. We conduct a subjective evaluation on the converted singing voice quality under noisy conditions and reveal that the vocoder-free VC is more robust against background sounds compared with the vocoder-based VC. We also analyze the robustness of statistical VC and show that a kurtosis ratio of power spectral components before and after conversion is useful as an objective metric to evaluate it without using any target reference signals.
This paper proposes a fast learning framework for non-parallel many-to-many voice conversion with residual Star Generative Adversarial Networks (StarGAN). In addition to the state-of-the-art StarGAN-VC approach that learns an unreferenced mapping between a group of speakers’ acoustic features for nonparallel many-to-many voice conversion, our method, which we call Res-StarGAN-VC, presents an enhancement by incorporating a residual mapping. The idea is to leverage on the shared linguistic content between source and target features during conversion. The residual mapping is realized by using identity shortcut connections from the input to the output of the generator in Res-StarGAN-VC. Such shortcut connections accelerate the learning process of the network with no increase of parameters and computational complexity. They also help generate high-quality fake samples at the very beginning of the adversarial training. Experiments and subjective evaluations show that the proposed method offers (1) significantly faster convergence in adversarial training and (2) clearer pronunciations and better speaker similarity of converted speech, compared to the StarGAN-VC baseline on both mono-lingual and cross-lingual many-to-many voice conversion tasks.
Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. The present sequence-to-sequence models can directly map text to mel-spectrogram acoustic features, which are convenient for modeling, but present additional challenges for vocoding (i.e., waveform generation from the acoustic features). High-quality synthesis can be achieved with neural vocoders, such as WaveNet, but such autoregressive models suffer from slow sequential inference. Meanwhile, their existing parallel inference counterparts are difficult to train and require increasingly large model sizes. In this paper, we propose an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrate a linear predictive synthesis filter into the model. Results show that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems. Recently proposed teacher-student frameworks in the PWG system have successfully achieved a real-time generation of speech signals. However, the difficulties optimizing the PDD criteria without auxiliary losses result in quality degradation of synthesized speech. To generate more natural speech signals within the teacher-student framework, we propose a novel optimization criterion based on generative adversarial networks (GANs). In the proposed method, the inverse autoregressive flow-based student model is incorporated as a generator in the GAN framework, and jointly optimized by the PDD mechanism with the proposed adversarial learning method. As this process encourages the student to model the distribution of realistic speech waveform, the perceptual quality of the synthesized speech becomes much more natural. Our experimental results verify that the PWG systems with the proposed method outperform both those using conventional approaches, and also autoregressive generation systems with a well-trained teacher WaveNet.
We propose voice conversion model from arbitrary source speaker to arbitrary target speaker with disentangled representations. Voice conversion is a task to convert the voice of spoken utterance of source speaker to that of target speaker. Most prior work require to know either source speaker or target speaker or both in training, with either parallel or non-parallel corpus. Instead, we study the problem of voice conversion in nonparallel speech corpora and one-shot learning setting. We convert an arbitrary sentences of an arbitrary source speaker to target speakers given only one or few target speaker training utterances. To achieve this, we propose to use disentangled representations of speaker identity and linguistic context. We use a recurrent neural network (RNN) encoder for speaker embedding and phonetic posteriorgram as linguistic context encoding, along with a RNN decoder to generate converted utterances. Ours is a simpler model without adversarial training or hierarchical model design and thus more efficient. In the subjective tests, our approach achieved significantly better results compared to baseline regarding similarity.
In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent. To this end, we propose to utilize the F0 as an additional input of the decoder. The model can learn to disentangle the latent code from the F0 and thus generates converted F0 dependent converted features. Second, to better capture temporal dependencies of the spectral features and the F0 pattern, we replace the frame wise conversion structure in the original VAE based VC framework with a fully convolutional network structure. Our experiments demonstrate that the degree of disentanglement as well as the naturalness of the converted speech are indeed improved.
The N10 system in the Voice Conversion Challenge 2018 (VCC 2018) has achieved high voice conversion (VC) performance in terms of speech naturalness and speaker similarity. We believe that further improvements can be gained from joint optimization (instead of separate optimization) of the conversion model and WaveNet vocoder, as well as leveraging information from the acoustic representation of the speech waveform, e.g. from Mel-spectrograms. In this paper, we propose a VC architecture to jointly train a conversion model that maps phonetic posteriorgrams (PPGs) to Mel-spectrograms and a WaveNet vocoder. The conversion model has a bottle-neck layer, whose outputs are concatenated with PPGs before being fed into the WaveNet vocoder as local conditioning. A weighted sum of a Mel-spectrogram prediction loss and a WaveNet loss is used as the objective function to jointly optimize parameters of the conversion model and the WaveNet vocoder. Objective and subjective evaluation results show that the proposed approach is capable of achieving significantly improved quality in voice conversion in terms of speech naturalness and speaker similarity of the converted speech for both cross-gender and intra-gender conversions.
This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed. Due to the difficulty of data collection, VC without parallel data is highly desired. Although techniques for unparallel VC — for example, CycleGAN — have been developed, they usually focus on transforming the speaker identity, and directly transforming the speech of one speaker to that of another speaker and as such do not address the task here. In this paper, we propose a new approach for unparallel VC. The proposed approach transforms impaired speech to normal speech while preserving the linguistic content and speaker characteristics. To our knowledge, this is the first end-to-end GAN-based unsupervised VC model applied to impaired speech. The experimental results show that the proposed approach outperforms CycleGAN.
This paper proposes a Group Latent Embedding for Vector Quantized Variational Autoencoders (VQ-VAE) used in nonparallel Voice Conversion (VC). Previous studies have shown that VQ-VAE can generate high-quality VC syntheses when it is paired with a powerful decoder. However, in a conventional VQ-VAE, adjacent atoms in the embedding dictionary can represent entirely different phonetic content. Therefore, the VC syntheses can have mispronunciations and distortions whenever the output of the encoder is quantized to an atom representing entirely different phonetic content. To address this issue, we propose an approach that divides the embedding dictionary into groups and uses the weighted average of atoms in the nearest group as the latent embedding. We conducted both objective and subjective experiments on the non-parallel CSTR VCTK corpus. Results show that the proposed approach significantly improves the acoustic quality of the VC syntheses compared to the traditional VQ-VAE (13.7% relative improvement) while retaining the voice identity of the target speaker.
In this work we introduce a semi-supervised approach to the voice conversion problem, in which speech from a source speaker is converted into speech of a target speaker. The proposed method makes use of both parallel and non-parallel utterances from the source and target simultaneously during training. This approach can be used to extend existing parallel data voice conversion systems such that they can be trained with semi-supervision. We show that incorporating semi-supervision improves the voice conversion performance compared to fully supervised training when the number of parallel utterances is limited as in many practical applications. Additionally, we find that increasing the number non-parallel utterances used in training continues to improve performance when the amount of parallel training data is held constant.
This paper proposes a determined blind source separation (BSS) method with a Bayesian generalization for unified modelling of multiple audio sources. Our probabilistic framework allows a flexible multi-source modelling where the number of latent features required for the unified model is optimally estimated. When partitioning the latent features of the unified model to represent individual sources, the proposed approach helps to avoid over-fitting or under-fitting the correlations among sources. This adaptability of our Bayesian generalization therefore adds flexibility to conventional BSS approaches, where the number of latent features in the unified model has to be specified in advance. In the task of separating speech mixture signals, we show that our proposed method models diverse sources in a flexible manner and markedly improves the separation performance as compared to the conventional methods.
In this paper we propose a method of single-channel speaker-independent multi-speaker speech separation for an unknown number of speakers. As opposed to previous works, in which the number of speakers is assumed to be known in advance and speech separation models are specific for the number of speakers, our proposed method can be applied to cases with different numbers of speakers using a single model by recursively separating a speaker. To make the separation model recursively applicable, we propose one-and-rest permutation invariant training (OR-PIT). Evaluation on WSJ0-2mix and WSJ0-3mix datasets show that our proposed method achieves state-of-the-art results for two- and three-speaker mixtures with a single model. Moreover, the same model can separate four-speaker mixture, which was never seen during the training. We further propose the detection of the number of speakers in a mixture during recursive separation and show that this approach can more accurately estimate the number of speakers than detection in advance by using a deep neural network based classifier.
This paper examines the applicability in realistic scenarios of two deep learning based solutions to the overlapping speaker separation problem. Firstly, we present experiments that show that these methods are applicable for a broad range of languages. Further experimentation indicates limited performance loss for untrained languages, when these have common features with the trained language(s). Secondly, it investigates how the methods deal with realistic background noise and proposes some modifications to better cope with these disturbances. The deep learning methods that will be examined are deep clustering and deep attractor networks.
Independent vector analysis (IVA) utilizing Gaussian mixture model (GMM) as source priors has been demonstrated as an effective algorithm for joint blind source separation (JBSS). However, an extra pre-training process is required to provide initial parameter values for successful speech separation. In this paper, we introduce a time-varying parameter in the GMM to adapt to the temporal power fluctuation embedded in the nonstationary speech signal so as to avoid the pre-training process. The expectation-maximization (EM) process updating both the demixing matrix and the signal model is altered correspondingly. Experimental results confirm the efficacy of the proposed method under random initialization and further show its advantage in terms of a competitive separation accuracy and a faster convergence speed.