| Total: 72
This paper presents a novel pitch tracking method in the time domain. Based on the difference function as used in YIN - referred to as the sum magnitude difference square function (SMDSF) thereinafter - we propose two modified types of SMDSFs, with several methods presented to calculate these SMDSFs efficiently and without bias by using the FFT algorithm. In pitch estimation, every type of SMDSF has its own estimation error characteristics. By analyzing these characteristics, we define a new function which combines the foresaid two types of SMDSFs to prevent estimation errors. A new, relatively accurate, and real-time pitch tracking algorithm is then proposed which does not need any extra preprocessing and post-processing. Experimental results show that this proposed algorithm can achieve remarkably good performance for pitch tracking.
This paper proposes a pitch estimation algorithm that is based on optimal harmonic model fitting. The algorithm operates directly on the time-domain signal and has a relatively simple mathematical background. To increase its efficiency and accuracy, the algorithm is applied in combination with an autocorrelation-based initialization phase. For testing purposes we compare its performance on pitch-annotated corpora with several conventional time-domain pitch estimation algorithms, and also with a recently proposed one. The results show that even the autocorrelation-based first phase significantly outperforms the traditional methods, and also slightly the recently proposed yin algorithm. After applying the second phase - the harmonic approximation step - the amount of errors can be further reduced by about 20% relative to the error obtained in the first phase.
Standard correlation based methods are not effective in estimating pitch tracks of multiple speech sources from a single-microphone input In this paper, an adaptive harmonic filtering is proposed to jointly estimate the source signals and their corresponding fundamental frequencies. By exploiting the harmonic structure of voiced speech, pitch information of one source is extracted from the pitch prediction filter and the output residual becomes the estimate of the other source. The procedure is iterated successively with a summation constraint. From the evolution of pitch prediction filter, it is shown that the iterative harmonic filtering with the summation constraint is effective to separate multiple pitch tracks into individual ones.
This paper introduces a new spectral representation-based pitch estimation method. Since pitch is never stationary during real conversations, but often undergoes changes because of intonation, the spectral representation is derived from the Short-time Harmonic Chirp Transform. This lets our technique to perform very well in noisy conditions, and to extract pitch values with high confidence, even from segments with strong intonations. The paper discusses a new way of segment-vice pitch extraction and does not deal with continuous pitch tracking, which is a topic of our future work. However, the performance of the proposed method is demonstrated on real recordings and the noise-dependency of its accuracy is numerically analyzed.
While there are numerous methods for estimating the fundamental frequency (F0) of speech, existing methods often suffer from pitch doubling or halving errors. Heuristics can be added to constrain the range of allowable F0 values, but it is still difficult to appropriately set the algorithm parameters if one does not know in advance the speaker's age or gender. The proposed method is distinct from most other F0-estimation algorithms in that it does not use autocorrelation, cepstral, or pattern-recognition techniques. Instead, information from 32 band-pass filters is combined at every frame, a Viterbi search provides an initial F0-contour estimate, and this estimate is then refined based on intensity discrimination of the speech signal. Despite the use of a large number of filters (which provide complementary information and hence robustness), the implementation works in less than real-time on a 2.4 GHz processor without optimization for processing speed. Results are presented for two corpora, one corpus of an adult male and one of children of different ages. For the first corpus, average absolute error is 4.10 Hz (percent error of 4.15%); for the second corpus, average absolute error is 7.74 Hz (percent error of 3.38%).
This work proposes a method to predict the fundamental frequency and voicing of a frame of speech from its MFCC representation. This has particular use in distributed speech recognition systems where the ability to predict fundamental frequency and voicing allows a time-domain speech signal to be reconstructed solely from the MFCC vectors. Prediction is achieved by modeling the joint density of MFCCs and fundamental frequency with a combined hidden Markov model-Gaussian mixture model (HMM-GMM) framework. Prediction results are presented on unconstrained speech using both a speaker-dependent database and a speaker-independent database. Spectrogram comparisons of the reconstructed and original speech are also made. The results show for the speaker-dependent task a percentage fundamental frequency prediction error of 3.1% is made while for the speaker-independent task this rises to 8.3%.
This article describes a F0 curve estimation scheme based on a B-spline model. We compare this model with more classical spline representation. The free parameters of both models are the number of knots and their location. An optimal location is proposed using a simulated annealing strategy. Experiments on real F0 curves confirm the adequacy and good performance of the B-spline approach, estimated via the least-square error criterion.
A time scale separation of voiced speech signals is introduced, which avoids the assumption of a frequency gap between the acoustic response and the prosodic drive. The non-stationary drive is extracted selfconsistently from a voice specific subband decomposition of the speech signal. When the band limited prosodic drive is used as fundamental drive of a two-level drive-response model, the voiced excitation can be reconstructed as a trajectory on a generalized synchronization manifold, which is suited to serve as cue for phoneme recognition and as fingerprint for speaker recognition.
We propose a method to estimate the glottal flow based on the ARX model of speech production and on the LF model of glottal flow. This method splits the analysis in two stages: a low frequency analysis to estimate the glottal source parameters which have mainly a low pass effect and a second step to refine the parameters which have also a high pass effect. Along with this new analysis scheme, we introduce a new algorithm to efficiently minimize the nonlinear function resulting from the least square criterion applied to the ARX model. Results on synthetic and natural speech signals prove the effectiveness of the proposed method.
In our earlier work, we have measured human intelligibility of stimuli reconstructed either from the short-time magnitude spectra or short-time phase spectra of a speech signal. We demonstrated that, even for small analysis window durations of 20-40 ms (of relevance to automatic speech recognition), the short-time phase spectrum can contribute to speech intelligibility as much as the short-time magnitude spectrum. Reconstruction was performed by overlap-addition of modified short-time segments, where each segment had either the magnitude or the phase of the corresponding original speech segment. In this paper, we employ an iterative framework for signal reconstruction. With this framework, we see that a signal can be reconstructed to within a scale factor when only phase is known, while this is not the case for magnitude. The magnitude must be accompanied by sign information (i.e., one bit of phase information) for unique reconstruction. In the absence of all magnitude information, we explore how much phase information is required for intelligible signal reconstruction. We observe that (i) intelligible signal reconstruction (albeit noisy) is possible from knowledge of only the phase sign information, and (ii) when both time and frequency derivatives of phase are known, adequate information is available for intelligible signal reconstruction. In the absence of either derivative, an unintelligible signal results.
In this paper, we continue our investigation of the warped discrete cosine transform cepstrum (WDCTC), which was earlier introduced as a new speech processing feature [1]. Here, we study the statistical properties of the WDCTC and compare them with the mel-frequency cepstral coefficients (MFCC).We report some interesting properties of the WDCTC when compared to the MFCC: its statistical distribution is more Gaussian-like with lower variance, it obtains better vowel cluster separability, it forms tighter vowel clusters and generates better codebooks. Further, we employ the WDCTC and MFCC features in a 5-vowel recognition task using Vector Quantization (VQ) and 1-Nearest Neighbour (1-NN) as classifiers. In our experiments, the WDCTC consistently outperforms the MFCC.
Current signal processing techniques do not match the astonishing ability of the Human Auditory System in recognizing isolated vowels, particularly in the case of female or child speech. As didactic and clinical interactive applications are needed using sound as the main medium of interaction, new signal features must be used that capture important perceptual cues more effectively than popular features such as formants. In this paper we propose the new concept of Perceptual Spectral Cluster (PSC) and describe its implementation. Test results are presented for child and adult speech, and indicate that features elicited by the PSC concept permit reliable and robust identification of vowels, even at high pitches.
The two distinct sound sources comprising voiced frication, voicing and frication, interact. One effect is that the periodic source at the glottis modulates the amplitude of the frication source originating in the vocal tract above the constriction. Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the modulation (envelope) domain, and a variable pitch compensation procedure. Results show a positive relationship between strength of the glottal source and modulation depth at voicing strengths below 66 dB SPL, at which point the modulation index was approximately 0.5 and saturation occurred. The alveolar [z] was found to be more modulated than other fricatives.
We propose a novel approach to the design of efficient representations of speech for various recognition tasks. Using a principled information theoretic framework - the Information Bottleneck method - which enables quantization that preserves relevant information, we demonstrate that significantly smaller representations of the signal can be obtained that still capture most of the relevant information about phonemes or speakers. The significant implications for building more efficient speech and speaker recognition systems are discussed.
The so-called Long-Term (LT) modeling of sinusoidal parameters, proposed in previous papers, consists in modeling the entire timetrajectory of amplitude and phase parameters over large sections of voiced speech, differing from usual Short- Term models, which are defined on a frame-by-frame basis. In the present paper, we focus on a specific novel contribution to this general framework: the comparison of four different Long- Term models, namely a polynomial model, a model based on discrete cosine functions, and combinations of discrete cosine with sine functions or polynomials. Their performances are compared in terms of synthesis signal quality, data compression and modeling accuracy, and the interest of the presented study for speech coding is shown.
New speech representation based on multiple filtering of temporal trajectories of speech energies in frequency sub-bands is proposed and tested. The technique extends earlier works on delta features and RASTA filtering by processing temporal trajectories by a bank of band-pass filters with varying resolutions. In initial tests on OGI Digits database the technique yields about 30% relative improvement in word error rate over the conventional PLP features. Since the applied filters have zero-mean impulse responses, the technique is inherently robust to linear distortions.
We present a novel method of dimension reduction and feature selection that makes use of category-dependent regions in highdimensional data. Our method is inspired by phoneme-dependent, noise-robust low-variance regions observed in the cortical response, and introduces the notion of category-dependence in a two-step dimension reduction process that draws on the fundamental principles of Fisher Linear Discriminant Analysis. As a method of applying these features in an actual pattern classification task, we construct a system of multiple speech recognizers that are combined by a Bayesian decision rule under some simplifying assumptions. The results show a significant increase in recognition rate for low signal-to-noise ratios compared with previous methods, providing motivation for further study on hierarchical, category-dependent recognition and detection.
Accurate speech activity detection is a challenging problem in the car environment where high background noise and high amplitude transient sounds are common. We investigate a number of features that are designed for capturing the harmonic structure of speech. We evaluate separately three important characteristics of these features: 1) discriminative power 2) robustness to greatly varying SNR and channel characteristics and 3) performance when used in conjunction with MFCC features. We propose a new features, the Windowed Autocorrelation Lag Energy (WALE) which has desirable properties.
Voice pathology detection and classification is a special research field in voice and speech processing for its deep social impact [8]. Historically in the development of new tools for pathology detection, different sets of distortion parameters have been defined, on one hand those estimating perturbations of certain acoustic voice features such as the pitch or energy, others estimating the dispersion of the spectral density of voice. Parameters based on the estimation of the residual of the glottal source related with the mucosal wave are of special interest among these last ones, as it may be shown these to be related with the biomechanical behavior of the vocal cords [2][3]. Therefore the amount of available parameters for pathology detection and classification is rather high. Although not all of them may have the same relevance depending on the specific objective to be covered. The present work is aimed to stress the relevance of different parameters for voice pathology detection using pruning techniques based on Principal Component Analysis. Specific experiments are used to stress the relevance of the parameters in differentiating normophonic and pathologic cases. Possible applications of the method to classify among pathologies could be derived from this study.
We propose a methodology of speech segmentation in which the LSF feature vector matrix of a segment is reconstructed optimally using a set of parametric/non-parametric functions. We have explored approximations using basis functions or polynomials. We have analyzed the performance of these methods w.r.t. phoneme segmentation (on 100 TIMIT sentences) and reconstruction error based on spectral distortion (SD) measure. We study how amenable these methods are to quantization and their suitability for speech coding. We also estimate the optimum number of segments depending on the reconstruction performance achieved using that many number of segments and the tolerance limit set on the spectral distortion error.
In the development of a syllable-centric Automatic Speech Recognition (ASR) system, segmentation of the speech signal into syllabic units is an important stage. In [1], an implicit algorithm is presented for segmenting the continuous speech signal into syllable-like units, in which the orthographic transcription is not used. In the present study, a new explicit segmentation algorithm is proposed and analyzed that uses the orthographic transcription of the given continuous speech signal. The advantage of using the transcription during segmentation is that the number of syllable segments present in the speech signal can be known a priori. Although the short-term energy (STE) function contains useful information about syllable segment boundaries, it cannot be directly used to perform segmentation due to significant local energy fluctuations. In the present work, an Auto-Regressive model-based algorithm is presented which essentially smooths the STE function using the knowledge of the number of syllable segments required/present in the given speech signal. Experiments carried out on the TIMIT speech corpus show that the error in segmentation is at most 40 ms for 87.84% of the syllable segments.
An extension of conventional speaker segmentation framework is presented for a scenario in which a number of microphones record the activity of speakers present at a meeting (one microphone per speaker). Although each microphone can receive speech from both the participant wearing the microphone (local speech) and other participants (cross-talk), the recorded audio can be broadly classified in three ways: local speech, cross-talk, and silence. This paper proposes a technique which takes into account cross-correlations, values of its maxima, and energy differences as features to identify and segment speaker turns. In particular, we have used classical cross-correlation functions, time smoothing and in part temporal constraints to sharpen and disambiguate timing differences between microphone channels that may be dominated by noise and reverberation. Experimental results show that proposed technique can be successively used for speaker segmentation of data collected from a number of different setups.
A vector autoregressive (VAR) model is used in the auditory time-frequency domain to predict spectral changes. Forward and backward prediction errors increases at the phone boundaries. These error signals are then used to study and detect the boundaries of the largest changes allowing the most reliable automatic segmentation. Using a fully unsupervised method yields segments consisting of a variable number of phones. The quality of performance of this method was tested with a set of 150 Finnish sentences pronounced by one female and two male speakers. The performance for English was tested using the TIMIT core test set. The boundaries between stops and vowels, in particular, are detected with high probability and precision.
Speakers with defective velopharyngeal mechanism, produce speech with inappropriate nasal resonances across vowel sounds. The acoustic analysis on hypernasal speech and nasalized vowels of normal speech shows that there is an additional frequency introduced in the low frequency region close to the first formant frequency [1]. The conventional formant extraction techniques may fail to resolve closely spaced formants. In this paper, an attempt is made to use the group delay based algorithm [2] for the extraction of formant frequencies from hypernasal speech. Preliminary experiments on synthetic signal with closely spaced formants show that the formants are better resolved in group delay spectrum when compared to conventional methods. But when formants are too close with wider bandwidths, the group delay algorithm also fails to resolve prominently. This is primarily because of the influence of the other resonances in the signal. To extract the additional frequency close to the first formant, the speech signal is low-pass filtered and the formants are extracted using group delay function. Following the satisfactory results on synthetic signal, the above technique is used to extract formants from phonations /a/, /i/, and /u/ uttered by 15 speakers with cleft palate who are expected to produce hypernasal speech. Invariably in all the tests, an additional nasal resonance around 250 Hz and first formant frequency of vowels are resolved properly.
In this paper we propose a novel method for the detection of relevant changes in continuous acoustic stream. The aim is to identify the optimal number and position of the change-points that split the signal into shorter, more or less homogeneous sections. First we describe the theory we used to derive the segmentation algorithm. Then we show how this algorithm can be implemented efficiently. Evaluation is done on broadcast news data with the goal to segment it into parts belonging to different speakers. In simulated tests with artificially mixed utterances the algorithm identified 97.1% of all speaker changes with precision of 96.5%. In tests done with 30 hours of real broadcast news (in 9 languages) the average recall was 80% and precision 72.3%.