| Total: 71
We present a new method for the estimation of a continuous fundamental frequency (F0) contour. The algorithm implements a global optimization and yields virtually error-free F0 contours for high quality speech signals. Such F0 contours are subsequently used to extract a continuous fundamental wave. Some local properties of this wave, together with a number of other speech features allow to classify the frames of a speech signal into five classes: voiced, unvoiced, mixed, irregularly glottalized and silence. The presented F0 detection and frame classification can be applied to F0 modeling and prosodic modification of speech segments in high-quality concatenative speech synthesis.
In this paper we present a method based on a time-varying sinusoidal model for a robust and accurate estimation of amplitude and frequency modulations (AM-FM) in speech. The suggested approach has two main steps. First, speech is modeled as a sinusoidal model with time-varying amplitudes. Specifically, the model makes use of a first order time polynomial with complex coefficients for capturing instantaneous amplitude and frequency (phase) components. Next, the model parameters are updated by using the previously estimated instantaneous phase information. Thus, an iterative scheme for AM-FM decomposition of speech is suggested which was validated on synthetic AM-FM signals and tested on reconstruction of voiced speech signals where the signal-to-error reconstruction ratio (SERR) was used as measure. Compared to the standard sinusoidal representation, the suggested approach found to improve the corresponding SERR by 47%, resulting in over 30 dB of SERR.
The paper presents a voice source waveform modeling techniques based on principal component analysis (PCA) and Gaussian mixture modeling (GMM). The voice source is obtained by inverse-filtering speech with the estimated vocal tract filter. This decomposition is useful in speech analysis, synthesis, recognition and coding. Existing models of the voice source signal are based on functionfitting or physically motivated assumptions and although they are well defined, estimation of their parameters is not well understood and few are capable of reproducing the large variety of voice source waveforms. Here, a data-driven approach is presented for signal decomposition and classification based on the principal components of the voice source. The principal components are analyzed and the eprototypef voice source signals corresponding to the Gaussian mixture means are examined. We show how an unknown signal can be decomposed into its components and/or prototypes and resynthesized. We show how the techniques are suited for both low bitrate or high quality analysis/synthesis schemes.
In this paper we propose a model-based approach to instantaneous pitch estimation in noisy speech, by way of incorporating pitch smoothness assumptions into the well-known harmonic model. In this approach, the latent pitch contour is modeled using a basis of smooth polynomials, and is fit to waveform data by way of a harmonic model whose partials have time-varying amplitudes. The resultant nonlinear least squares estimation task is accomplished through the Gauss-Newton method with a novel initialization step that serves to greatly increase algorithm efficiency. We demonstrate the accuracy and robustness of our method through comparisons to state-of-the art pitch estimation algorithms using both simulated and real waveform data.
Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed.
Popular parametric models of speech sounds such as the sourcefilter model provide a fixed means of describing the variability inherent in speech waveform data. However, nonlinear dimensionality reduction techniques such as the intrinsic Fourier analysis method of Jansen and Niyogi provide a more flexible means of adaptively estimating such structure directly from data. Here we employ this approach to learn a low-dimensional manifold whose geometry is meant to reflect the structure implied by the human speech production system. We derive a novel algorithm to efficiently learn this manifold for the case of many training examples - the setting of both greatest practical interest and computational difficulty. We then demonstrate the utility of our method by way of a proof-of-concept phoneme identification system that operates effectively in the intrinsic Fourier domain.
Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. These DCT/DCS features can be computed so as to emphasize frequency resolution or time resolution or a combination of the two factors. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.85% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust.
Phase information resultant from the harmonic analysis of the speech can be very successfully used to determine the polarity of a voiced speech segment. In this paper we present two algorithms which calculate the signal polarity from this information. One is based on the effect of the glottal signal on the phase of the first harmonics and the other on the relative phase shifts between the harmonics. The detection rates of these two algorithms are compared against others established algorithms.
In this paper, we present a simple and efficient feature modeling approach for tracking the pitch of two speakers speaking simultaneously. We model the spectrogram features using Gaussian Mixture Models (GMMs) in combination with the Minimum Description Length (MDL) model selection criterion. This enables to automatically determine the number of Gaussian components depending on the available data for a specific pitch pair. A factorial hidden Markov model (FHMM) is applied for tracking. We compare our approach to two methods based on correlogram features [1]. Those methods either use a HMM [1] or a FHMM [7] for tracking. Experimental results on the Mocha-TIMIT database [2] show that our proposed approach significantly outperforms the correlogrambased methods for speech utterances mixed at 0dB. The superior performance even holds when adding white Gaussian noise to the mixed speech utterances during pitch tracking.
In this paper, we investigate a new method for extracting useful information from the group delay spectrum of speech. The group delay spectrum is often poorly behaved and noisy. In the literature, various methods have been proposed to address this problem. However, to make the group delay a more tractable function, these methods have typically relied upon some modification of the underlying speech signal. The method proposed in this paper does not require such modifications. To accomplish this, we investigate a new function derived from the group delay spectrum, namely the group delay deviation. We use it for both narrowband analysis and wideband analysis of speech and show that this function exhibits meaningful formant and pitch information.
Throat microphones (TM) which are robust to background noise can be used in environments with high levels of background noise. Speech collected using TM is perceptually less natural. The objective of this paper is to map the spectral features (represented in the form of cepstral features) of TM and close speaking microphone (CSM) speech to improve the formers perceptual quality, and to represent it in an efficient manner for coding. The spectral mapping of TM and CSM speech is done using a multilayer feed-forward neural network, which is trained from features derived from TM and CSM speech. The sequence of estimated CSM spectral features is quantized and coded as a sequence of codebook indices using vector quantization. The sequence of codebook indices, the pitch contour and the energy contour derived from the TM signal are used to store/transmit the TM speech information efficiently. At the receiver, the all-pole system corresponding to the estimated CSM spectral vectors is excited by a synthetic residual to generate the speech signal.
This paper examines the Lombard effect on the excitation features in speech production. These features correspond mostly to the acoustic features at subsegmental (< pitch period) level. The instantaneous fundamental frequency F0 (i.e., pitch), the strength of excitation at the instants of significant excitation and a loudness measure reflecting the sharpness of the impulse-like excitation around epochs are used to represent the excitation features at the subsegmental level. The Lombard effect influences the pitch and the loudness. The extent of Lombard effect on speech depends on the nature and level (or intensity) of the external feedback that causes the Lombard effect.
In this study a number of linear and nonlinear dimensionality reduction methods are applied to high dimensional representations of synthetic speech to produce corresponding low dimensional embeddings. Several important characteristics of the synthetic speech, such as formant frequencies and f0, are known and controllable prior to dimensionality reduction. The degree to which these characteristics are retained after dimensionality reduction is examined in visualisation and classification experiments. Results of these experiments indicate that each method is capable of discovering meaningful low dimensional representations of synthetic speech and that the nonlinear methods may outperform linear methods in some cases.
Current research has proposed a non-parametric speech waveform representation (rep) based on zeros of the z-transform (ZZT) [1] [2]. Empirically, the ZZT rep has successfully been applied in discriminating the glottal and vocal tract components in pitchsynchronously windowed speech by using the unit circle (UC) as discriminant [1,2]. Further, similarity between ZZT reps of windowed speech, glottal flow waveforms, and waveforms of glottal flow opening and closing phases has been demonstrated [1,3]. Therefore, the underlying cause of the separation on either side of the UC can be analyzed via the individual ZZT reps of the opening and closing phase waveforms; the waveforms are generated by the LF glottal flow model (GFM) [1]. The present paper demonstrates this cause and effect analytically and thereby supplement the previous empirical works. Moreover, this paper demonstrates that immiscibility is variant under changes in frame lengths; lengths that maximize or minimize immiscibility are presented.
In this paper, we investigate a novel method for transforming line spectral frequency (LSF) parameters to lower dimensional coefficients. Radial basis function neutral network (RBF NN) based transforming model is used to fit LSF vectors. In the training process, two criterions, including mean squared error and weighted mean squared error, are involved to measure distance between original vector and approximate vector. Besides, features of LSF parameters are taken into account to supervise the training process. As a result, LSF vectors are represented by the coefficient vectors of transforming model. The experimental results reveal that 24-order LSF vector can be transformed to 15-dimension coefficient vector with an average spectral distortion of approximately 1dB. Subjective evaluation manifests that the transforming method in this paper will not lead to significant voice quality decreasing.
In this paper, we present a study to understand the relation among spectra of speakers enunciating the same sound and investigate the issue of uniform versus non-uniform scaling. There is a lot of interest in understanding this relation as speaker variability is a major source of concern in many applications including Automatic Speech Recognition (ASR). Using dynamic programming, we find mapping relations between smoothed spectral envelopes of speakers enunciating the same sound and show that these relations are not linear but have a consistent non-uniform behavior. This non-uniform behavior is also shown to vary across vowels. Through a series of experiments, we show that using the observed non-uniform relation provides better vowel normalization than just a simple linear scaling relation. All results in this paper are based on vowel data from TIMIT, Hillenbrand et al. and North Texas databases.
Frequency modulation (FM) features are typically extracted using a filterbank, usually based on an auditory frequency scale, however there is psychophysical evidence to suggest that this scale may not be optimal for extracting speaker-specific information. In this paper, speaker-specific information in FM features is analyzed as a function of the filterbank structure at the feature, model and classification stages. Scatter matrix based separation measures at the feature level and Kullback-Leibler distance based measures at the model level are used to analyze the discriminative contributions of the different bands. Then a series of speaker recognition experiments are performed to study how each band of the FM feature contributes to speaker recognition. A new filter bank structure is proposed that attempts to maximize the speaker-specific information in the FM feature for telephone data. Finally, the distribution of speaker-specific information is analyzed for wideband speech.
In this contribution, a method for nasalization of speech sounds is proposed based on model-based spectral relations between mouth and nose signals. For that purpose, the mouth and nose signals of speech utterances are recorded simultaneously. The spectral relations of the mouth and nose signals are modeled by pole-zero models. A filtering of non-nasalized speech signals by these pole-zero models yields approximately nasal signals, which can be utilized to nasalize the speech signals. The artificial nasalization can be exploited to modify speech units of a non-nasalized or weakly nasalized representation which should be nasalized due to coarticulation or for the production of foreign words.
An auditory nerve model allows faster investigation of new signal processing algorithms for hearing aids. This paper presents a study of the degradation of auditory nerve (AN) responses at a phonetic level for a range of sensorineural hearing losses and flat audiograms. The AN model of Zilany & Bruce was used to compute responses to a diverse set of phoneme rich sentences from the TIMIT database. The characteristics of both the average discharge rate and spike timing of the responses are discussed. The experiments demonstrate that a mean absolute error metric provides a useful measure of average discharge rates but a more complex measure is required to capture spike timing response errors.
This paper proposes a method to automatically measure the timing characteristics of a second-language learners speech as a means to evaluate language proficiency in speech production. We used the durational differences from native speakers speech as an objective measure to evaluate the learners timing characteristics. To provide flexible evaluation without the need to collect any additional English reference speech, we employed predicted segmental durations using a statistical duration model instead of measured raw durations of natives speech. The proposed evaluation method was tested using English speech data uttered by Thai-native learners with different English-study experiences. An evaluation experiment shows that the proposed measure based on duration differences closely correlates to the subjects Englishstudy experiences. Moreover, segmental duration differences revealed Thai learners speech-control characteristics in word-final stress assignment. These results support the effectiveness of the proposed model-based objective evaluation.
A Bayesian approach to non-intrusive quality assessment of narrow-band speech is presented. The speech features used to assess quality are the sample mean and variance of band-powers evaluated from the temporal envelope in the channels of an auditory filter-bank. Bayesian multivariate adaptive regression splines (BMARS) is used to map features into quality ratings. The proposed combination of features and regression method leads to a high performance quality assessment algorithm that learns efficiently from a small amount of training data and avoids overfitting. Use of the Bayesian approach also allows the derivation of credible intervals on the model predictions, which provide a quantitative measure of model confidence and can be used to identify the need for complementing the training databases.
Some phoneme boundaries correspond to abrupt changes in the acoustic signal. Others are less clear-cut because the transition from one phoneme to the next is gradual.
This paper introduces a novel method of speech epoch extraction using a modified Wigner-Ville distribution. The Wigner-Ville distribution is an efficient speech representation tool with which minute speech variations can be tracked precisely. In this paper, epoch detection/extraction using accurate energy tracking, noise robustness, and the efficient speech representation properties of a modified discrete Wigner-Ville distribution is explored. The developed technique is tested using the Arctic database and its epoch information from an electro-glottograph as reference epochs. The developed algorithm is compared with the available state of the art methods in various noise conditions (babble, white, and vehicle) and different levels of degradation. The proposed method outperforms the existing methods in the literature.
Due to the increasing use of fusion in speaker recognition systems, features that are complementary to MFCCs offer opportunities to advance the state of the art. One promising feature is based on group delay, however this can suffer large variability due to its numerical formulation. In this paper, we investigate reducing this variability in group delay features with least squares regularization. Evaluations on the NIST 2001 and 2008 SRE databases show a relative improvement of at least 6% and 18% EER respectively when group delay-based system is fused with MFCC-based system.
This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a meanbased signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization.