| Total: 324
Based on the comparison between 4 esophageal speakers and 4 normal laryngeal speakers, this study investigated the voice onset time (VOT) characteristics and the linguopalatal articulation in the production of Mandarin obstruent consonants. Results show that esophageal speakers distinguish unaspirated vs. aspirated plosives or affricates in a similar way as laryngeal speakers do. However, the aspirated plosives and affricates have a shorter VOT whereas the unaspirated plosives and affricates have a longer VOT in esophageal speech than in laryngeal speech. Interestingly, esophageal speech exhibits a significantly more extensive linguopalatal contact than normal speech does. Results suggest that articulatory strategies have been adjusted to facilitate the linguopalatal articulation as well as the sub-tosupra- laryngeal coordination by using a narrower air way in the production of esophageal speech.
The aim of this acoustic study is to examine place of articulation during the production of voiced plosives by children with a cleft palate. The data is compared with those obtained from control children without speech disorders. Formant values of F2 and F3 are measured at burst release of the plosives. Analyses are carried out for 52 children, from 9 to 18 years old. Two groups were established: one of 26 children with a cleft palate, and one of 26 children with no speech disorder. Results reveal differences in F2/F3 values between disordered children and unimpaired ones, and also between the different age groups, for disordered and control children. Differences in articulatory strategies are inferred from the analyses of F2/F3 relations
In this paper, automatic dysarthria severity classification is explored as a tool to advance objective intelligibility prediction of spastic dysarthric speech. A Mahalanobis distance-based discriminant analysis classifier is developed based on a set of acoustic features formerly proposed for intelligibility prediction and voice pathology assessment. Feature selection is used to sift salient features for both the disorder severity classification and intelligibility prediction tasks. Experimental results show that a two-level severity classifier combined with a 9-dimensional intelligibility prediction mapping can achieve 0.92 correlation and 12.52 root-mean-square error with subjective intelligibility ratings. The effects of classification errors on intelligibility accuracy are also explored and shown to be insignificant.
Empirical mode decomposition (EMD) algorithm is proposed as an alternative to decompose the log of the magnitude spectrum of the speech signal into its harmonic, envelope and noise components and the harmonic-to-noise ratio is used to summarize the degree of disturbance in the speech signal. The empirical mode decomposition algorithm is a tool for the analysis of multi-component signals. The analysis method does not require a priori fixed basis function like conventional analysis methods (e.g. Fourier transform and wavelet transform).The proposed method is tested on synthetic vowels and natural speech. The corpus of synthetic vowels comprises 48 stimuli of synthetic sounds [a] that combine three values of vocal frequency, four levels of jitter frequency and four levels of additive noise. The corpora of natural speech comprise a concatenation of the vowel [a] with two Dutch sentences produced by 28 normophonic and 223 speakers with different degrees of dysphonia.
The Artificial Larynx Transducer (ALT), as a possibility to re-obtain audible speech for people who had to undergo a total laryngectomy, is known since decades. Not only the design and underlying technique but also the poor speech quality and intelligibility have not improved until now. In a world where technology rules the daily live, it is necessary to use the known technology to improve the quality of live for handicapped people. One reason for the lack of naturalness is the constant vibration of the ALT. A method to substantially improve ALT speech is to introduce a varying fundamental frequency (F0) - contour. In this paper we present a new method to automatically learn an artificial F0-contour. The used model is a Gaussian mixture model (GMM) which is trained with a database containing speech of ALT users as well as healthy people. Informal listening tests suggest that this approach is a first step for a subsequent overall enhancement technique for speech produced by an ALT.
Speech sound disorders (SSD) are the most common communication impairment in childhood, and can unfortunately hamper social development and learning. Current speech therapy interventions must rely predominantly on the auditory skills of the child, as little technology is available to assist in diagnosis and therapy of SSDs. Realtime visualisation of tongue movements would bring enormous benefit. An ultrasound scanner offers this possibility, though its display has certain limitations which may make it hard to interpret. Our ultimate goal is to address these deficiencies: to exploit ultrasound to track tongue movement, but to display a simplified, diagrammatic vocal tract that is easier to interpret. In this paper, we first outline our general approach to this problem, which combines a latent space model with a dimensionality reducing model of vocal tract shapes. Then, we present pilot work to assess the feasibility of this approach. Specifically, we use MRI scans to train a model of vocal tract shapes, then attempt to animate that model using electromagnetic articulography (EMA) data from the same speaker. Piloting with EMA data is an intermediate step. It is simpler than using ultrasound, but still provides valuable insight. Based on these initial experiments, we argue the approach is promising.
We present a large-scale study on classification of linguistic and non-linguistic vocalizations including laughter, vocal noise, hesitation and consent on four corpora amounting to 46 hours of spontaneous conversational speech. We consider training and testing on speaker-independent subsets of single corpora (intra-corpus) as well as inter-corpus experiments where models built on one or more corpora are evaluated on a disjoint corpus. Our results reveal that while inter-corpus performance is considerably lower than comparable intra-corpus results, this effect can be countered by data agglomeration; furthermore, we observe that inter-corpus classification accuracies indicate suitability of corpora for building generalizing models.
This study aims at examining the accentual and phrasing properties of a variety of L2 French commonly called “Français Fédéral”, a variety of French spoken in Switzerland by speakers who have a Swiss-German dialect as a mother tongue. For this, we compared the data of 4 groups of 4 speakers: 2 groups of 4 native French speakers from Neuchâtel and from Paris, and 2 groups of 4 Swiss-German French speakers from Bern and Zürich. The data are semi-automatically processed, and three main prosodic features relating to accentuation and phrasing are examined: prominence distribution and metrical weight of the Phonological Phrase, respect of Phonological Phrase formation constraints (Align-XP and No-clash), and realizations of sandhis phenomena within and across the Phonological Phrases boundaries. Our findings suggest that “Français Fédéral” share several features with a lexical accentuation system rather than with a supra-lexical accentuation system.
Young multi-ethnolectal speakers of Hamburg-German introduced an alternation of /ç/ to [ʧ] following a lax front vowel /ɪ/. We conducted perception studies exploiting this contrast in Berlin (Germany), a city with large multi-ethnic neighborhoods. This alternation is pervasive and noticeable, it is mocked and stigmatized and there is an awareness that many young speakers (including ethnically Germans) from neighborhoods with larger migrant populations like Kreuzberg (KB) substitute /ç/ with /ʃ/ while speakers from less stigmatized vicinities like Zehlendorf (ZD) do not. The categorization of items on two 14-step synthesized continua from "Fichte" 'spruce' to "fischte" '3rd p. sg. to fish' by 99 listeners shows that the interpretation of fine phonetic detail is strongly influenced by the co-presentation of the label KB or ZD in contrast to no label (control). Analyses of the reaction times (RTs) show that significantly more time is needed to process stimuli in KB and less in ZD. Moreover, younger listeners (below 30 years) perceive more /ʃ/ variants than older listeners. Phonological generalization over phonetic input is dependent on associative information: perceptual divergence is found within the confines of a single large urban area.
The background of the present work is the development of a tele-operation system where the lip motion of a remote humanoid robot is automatically controlled from the operatorfs voice. In the present paper, we introduce an improved version of our proposed speech-driven lip motion generation method, where lip height and width degrees are estimated based on vowel formant information. The method requires the calibration of only one parameter for speaker normalization, so that no training of dedicated models is necessary. Lip height control is evaluated in a female android robot Geminoid-F and in an animated face. Subjective evaluation indicated that naturalness of lip motion generated in the robot is improved by the inclusion of a partial lip width control (with stretching of the lip corners). Highest naturalness scores were achieved for the animated face, showing the effectiveness of the proposed method.
We explore the use of two spectral measures calculated in ERB space for differentiating between the frication noise of sibilants in Mandarin Chinese and Korean. The peak frequency (peakERB) of the spectral representation was used to capture differences in front cavity size and a compactness index (CI) was used to capture the bandwidth of the peak. In both /a/ and /i/ vowel contexts the peakERB measure differentiated between Mandarin [s], [ɕ], and [ʃ], and also between Korean [sh] and [s*] which, although considered to be articulated at the same place, differ in front cavity size due to the tighter lingual constriction of [s*]. The CI measure helped further differentiate the fricatives, with Mandarin [ɕ] having a broader peak (higher CI) than [s] or [ʃ], and Korean [sha] having a broader peak than [s*a]. When applied to the L2 Korean productions of L1 Mandarin speakers, we found evidence for both Korean fricatives assimilating to Mandarin [s] before /a/, and to Mandarin [ɕ] before /i/.
The study examines interactions associated with foreigner-directed speech (FDS) within EFL triadic conversation exam where learners differ in outspokenness. The results showed that learners' outspokenness affected foreign examiners' vowel articulation. More vowels underwent distinct change across learners' positive-coded utterances in F2, whereas in F1 the vowel change in the high-communicative group during the introduction of a new question was the greatest source of variation. The findings suggest that native speakers tend to hyperarticulate during interactions with the low-communicate group, and probably with didactic intent to instruct or smooth communicative situations, in particular during more encouraging speech acts.
We conducted a perception study on Mandarin, a tone language where pitch carries contrastive information, to investigate whether pitch changes can override spectral information in determining the number of syllables in an utterance. We generated F0 contours and simulated tonal coarticulation using the qTA model. The perception of syllable numbers depended on the perception of tones, and this effect held across speech rate. Combining with prior work, the results indicate that laryngeal and supralaryngeal events interact in syllable perception in tone languages. We discuss how our findings support the notion of language-specific perception.
Automatic forced alignment between transcriptions has achieved high levels of agreement for languages with large corpora, but the technique holds great promise for work on all languages. Here, we apply two forced alignment programs to data from an endangered Mixtecan language of Mexico. Both yielded a majority of boundaries within 20 ms of hand-labeled ones. Phonemes with fairly steady-state elements (e.g. nasals, fricatives) were more accurately labeled than others. Forced alignment thus may increase efficiency of labeling texts from smaller languages, at least in cases where the phoneme inventories are similar to those of the languages of the training.
We examined the glottal opening pattern during devoicing environment,with respect to the factors that facilitate or suppress devoicing. The results indicated that glottal opening patterns are twofold: a single phase- and a double phase opening for /CVC/. Only single phase openings appeared at typical consonantal environments for a Tokyo speaker. Gestural reorganization is assumed for these cases. Double phase opening appeared for an atypical consonantal environment for the Tokyo speaker. For a speaker of Osaka dialect, in which devoicing is less frequent, double phase opening appeared regardless of a typical or atypical consonantal environment. The effect of atypical consonantal environments and dialect on devoicing are due to glottal gesture overlap. In faster speech, a double phase tends to merge which facilitates devoicing. In consecutive devoicing environments, the vowel in a typical consonantal environment is the first candidate with/without devoicing of the following vowel. The following vowel can be devoiced if the glottal opening for the preceding /CVC/ is reorganized. During the phrase final /u/, both single- and double phase openings appeared for Tokyo speakers. Greater interpersonal variations appeared for devoicing at phrase final position.
This study examines the effects of lexical stress on vowel quality in Iberian Spanish and Central Catalan and the relationship between phonetic variation caused by stress and speech rate. For Catalan, both absence of stress and fast speech rate result in a shrunken vowel space. For Spanish, the effects of speech rate are in line with those found for Catalan. Yet, unstressed vowels are characterized by lower F1 than their stressed counterparts. Results are discussed in the light of the Hyperarticulation Hypothesis and the Sonority Expansion Hypothesis.
Research on vowel space dispersion has found that vowel spaces are less dispersed when lexical and contextual factors favor identification of a word (e.g., a highly predictable word), and vowel spaces are more dispersed when lexical and contextual factors create an environment for less predictability. Examining sound changes in progress, recent work has demonstrated that, for some vowel changes, a vowel will be produced in a more innovative way when the lexical item in which it occurs is highly semantically predictable. The data for such claims have been collected in laboratory settings. In this paper we investigate whether these findings from the laboratory extend into more naturalistic settings, and we examine the precise character of the reduction by examining dynamic movement across a vowel's duration.
The pronunciation patterns across dialect regions in the United States are changing. This paper examines acoustic differences between the vowels of old and middle-aged adults produced in spontaneous speech. The new developments in regional vowels systems which differentiate American English dialects found recently in citation-form vowels and in read speech were confirmed in the present analysis. While providing additional evidence for the existence of sound change on the basis of spontaneous speech data, this work brings to light important challenges facing researchers using spontaneous speech in laboratory analyses such as the loss of fine experimental control in the examination of spectral dynamics or the usefulness of statistical assessment.
Numerous studies have examined the properties of “clear” speech—speech produced in the context of real or imagined communicative difficulties. In general, clear speech is characterized by hyperarticulation. However, the effect of clear speech on coarticulation varies: simulated clear speech contexts have less nasal coarticulation while speech directed toward a real listener has more. Additionally, both hyperarticulation and nasal coarticulation vary based on neighborhood density (ND): words from dense phonological neighborhoods have a greater degree of both, relative to words from sparse neighborhoods. This study examines what consequences these effects in production have for perception by means of a lexical decision task with speech from two different “clear” conditions. The findings indicate that real listener directed speech is perceived faster than simulated clear speech. Further, Hi ND words (with hyperarticulation and increased coarticulation, as in real listener directed speech) were faster than Lo ND words.
This paper investigates the effect of vowel quality on the perception of coda nasals in Southern Min. The perceptual confusion experiment revealed that /m/ is the most confusable coda nasal, followed by /ŋ/ and then /n/. The high front vowel /i/ resulted in more misidentification of following coda nasals than mid vowel /ə/ and low vowel /a/. Within the same vowel context, higher formant frequency at the juncture of vowels and nasals and greater formant change from the midpoint to the endpoint of vowels provided more salient acoustic cues to place of articulation of post-vocalic nasals and thus resulted in higher accuracy of coda nasal identification.
The alternation between voiced plosives and spirants in Iberian languages is described as the complementary distribution between two allophones. The present study explores the acoustics of /d/ in two corpora of spontaneous speech and examines the hypothesis that constriction degree in /d/ is governed by finer-grained speech production factors than claimed before. Three acoustic metrics were developed as indexes of articulatory weakening. The findings suggest that variations in the implementation of /d/ result from gradient modulations in constriction degree on a unimodal, rather than bimodal, statistical-acoustic distribution. The preceding segment is a strong predictor of the weakening (i.e., spirantization) of Catalan and Spanish /d/.
The article introduces a novel approach for noise reduction based on distance approximation of signal sources. This is achieved by estimation of the specific acoustic impedance of the incoming signal. The method uses small-sized two-microphone arrangements or so called twochannel microphones which barely exceed the spatial extend of a single microphone. The novel adaptive block-online algorithm significantly reduces the level of distant signals having the same angle of arrival as the closer source signal. The overall method has been tested in an anechoic chamber and a more realistic room with 250 ms reverberation time. The sources are located at distances of 5 cm and 1 m from the test microphone. The experiment shows a level reduction of 5.1 dB with regard to a distant noise source. The signal level of the close-tomicrophone speaker is not effected. The approach is targeting on new applications but also provides algorithmic improvements for existing applications - e.g. for the noise reduction in hand-held microphones or for distance-dependent noise gates.
In this paper, we present a novel DOA estimation method for human speech using subband weighting. Existing DOA estimation methods still can not perform quite reliably in low SNR condition. To improve the robustness of DOA estimator in noisy environment, we propose a novel DOA estimation approach. Firstly, the speech signal of each channel is passed through a Gammatone filterbank to obtain a set of time-domain subband signals. Secondly, we achieve TDOA estimation based on a new cost function in each subband. subband weight is calculated to emphasis the estimation results of subbands with high possibility containing speech. Finally, DOA is determined by the estimated TDOA and geometry of microphone array. Experimental results show that the proposed subband weighting based method outperforms SRP-PHAT and broadband MUSIC algorithm in highly noisy environment.
Binaural noise-reduction techniques based on Multichannel Wiener filter (MWF) have been reported as promissing candidates to be used in binaural hearing aids because of their efficient noise reduction at any arbitrary direction of arrival of the target signal and the preservation of localization cues. The implementation of these techniques have been reported for FFT-based processing and wavelet-packet-based (WP) processing. In these implementations, the processing delay is large and limited to the block processing inherent in the FFT and WP computation. This paper proposes a different implementation for MWF by means of frequency-warped FIR filters, providing a performance near a WP-based MWF and smaller processing delay.
It is well known that human conversational speech is “sparse” in time domain, comprising of many “off” time segments. This suggests the utility of the “off” time nature for the task of speech enhancement. We propose an efficient dualmicrophone method based on regularized cross-channel cancellation to distinguish the overlapping and single speech segments in the multi-speaker conversational environment. Fortunately, the regularized cancellation results can be reused for speech enhancement along an interference-suppression chain. We present evaluations of the proposed overlapping speech detection and integrated speech enhancement approaches using an IEEE speech database and real room recordings under various acoustic environments, showing promising improvements for speech enhancement by exploring the off time nature.