INTERSPEECH.2019 - Others

Total: 373

#1 Individual Variation in Cognitive Processing Style Predicts Differences in Phonetic Imitation of Device and Human Voices [PDF] [Copy] [Kimi1]

Authors: Cathryn Snyder ; Michelle Cohn ; Georgia Zellou

Phonetic imitation, or implicitly matching the acoustic-phonetic patterns of another speaker, has been empirically associated with natural tendencies to promote successful social communication, as well as individual differences in personality and cognitive processing style. The present study explores whether individual differences in cognitive processing style, as indexed by self-reported scored from the Autism-Spectrum Quotient (AQ) questionnaire, are linked to the way people imitate the vocal productions by two digital device voices (i.e., Apple’s Siri) and two human voices. Subjects first performed a word shadowing task of human and device voices and then completed the self-administered AQ. We assessed imitation of two acoustic properties: f0 and vowel duration. We find that the attention to detail and the imagination subscale scores on the AQ mediated degree of imitation of f0 and vowel duration, respectively. The findings yield new insight to speech production and perception mechanisms and how it interacts with individual cognitive processing style differences.

#2 An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion [PDF] [Copy] [Kimi1]

Authors: Aravind Illa ; Prasanta Kumar Ghosh

Estimating speech representations from articulatory movements is known as articulatory-to-acoustic forward (AAF) mapping. Typically this mapping is learned using directly measured articulatory movement in a subject-specific manner. Such AAF mapping has been shown to benefit the speech synthesis applications. In this work, we investigate the speaker similarity and naturalness of utterances generated by AAF which is driven by the articulatory movements from a subject (referred to as cross speaker) different from the speaker (target speaker) used for training AAF mapping. Experiments are performed with directly measured articulatory data from 9 speakers (8 target speakers and 1 cross speaker), which are recorded using Electromagnetic articulograph AG501. Experiments are also performed with articulatory features estimated using speaker independent acoustic-to-articulatory inversion (SI-AAI) model trained on 26 reference speakers. Objective evaluation on target speakers reveal that the articulatory features estimated from SI-AAI result in a lower Mel-cepstrum distortion compared to that using directly measured articulatory features. Further, listening tests reveal that the directly measured articulatory movements preserve the speaker similarity better than estimated ones. Although, for naturalness, articulatory movements predicted by SI-AAI perform better than the direct measurements.

#3 Individual Difference of Relative Tongue Size and its Acoustic Effects [PDF] [Copy] [Kimi1]

Authors: Xiaohan Zhang ; Chongke Bi ; Kiyoshi Honda ; Wenhuan Lu ; Jianguo Wei

This study examines how the speaker’s tongue size contributes to generating dynamic characteristics of speaker individuality. The relative tongue size (RTS) has been proposed as an index for the tongue area within the oropharyngeal cavity on the midsagittal magnetic resonance imaging (MRI). Our earlier studies have shown that the smaller the RTS, the faster the tongue movement. In this study, acoustic consequences of individual RTS values were analyzed by comparing tongue movement velocity and formant transition rate. The materials used were cine-MRI data and acoustic signals during production of a sentence and two words produced by two female speakers with contrasting RTS values. The results indicate that the speaker with the small RTS value exhibited the faster changes of tongue positions and formant transitions than the speakers with the large RTS values. Since the tongue size is uncontrollable by a speaker’s intention, the RTS can be regarded as one of the causal factors of dynamic individual characteristics in the lower frequency region of speech signals.

#4 Individual Differences of Airflow and Sound Generation in the Vocal Tract of Sibilant /s/ [PDF] [Copy] [Kimi1]

Authors: Tsukasa Yoshinaga ; Kazunori Nozaki ; Shigeo Wada

To clarify the individual differences of flow and sound characteristics of sibilant /s/, the large eddy simulation of compressible flow was applied to vocal tract geometries of five subjects pronouncing /s/. The vocal tract geometry was extracted by separately collecting images of digital dental casts and the vocal tract of /s/. The computational grids were constructed for each geometry, and flow and acoustic fields were predicted by the simulation. Results of the simulation showed that jet flow in the vocal tract was disturbed and fluctuated, and the sound source of /s/ was generated in different place for each subject. With an increment of the jet velocity, not only the overall sound amplitude but also the spectral mean was increased, indicating that the increment of the jet velocity contributes to the increase of amplitudes in a higher frequency range among different vocal tract geometries.

#5 Hush-Hush Speak: Speech Reconstruction Using Silent Videos [PDF] [Copy] [Kimi1]

Authors: Shashwat Uttam ; Yaman Kumar ; Dhruva Sahrawat ; Mansi Aggarwal ; Rajiv Ratn Shah ; Debanjan Mahata ; Amanda Stent

Speech Reconstruction is the task of recreation of speech using silent videos as input. In the literature, it is also referred to as lipreading. In this paper, we design an encoder-decoder architecture which takes silent videos as input and outputs an audio spectrogram of the reconstructed speech. The model, despite being a speaker-independent model, achieves comparable results on speech reconstruction to the current state-of-the-art speaker-dependent model. We also perform user studies to infer speech intelligibility. Additionally, we test the usability of the trained model using bilingual speech.

#6 SPEAK YOUR MIND! Towards Imagined Speech Recognition with Hierarchical Deep Learning [PDF] [Copy] [Kimi1]

Authors: Pramit Saha ; Muhammad Abdul-Mageed ; Sidney Fels

Speech-related Brain Computer Interface (BCI) technologies provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals. In order to infer imagined speech from active thoughts, we propose a novel hierarchical deep learning BCI system for subject-independent classification of 11 speech tokens including phonemes and words. Our novel approach exploits predicted articulatory information of six phonological categories (e.g., nasal, bilabial) as an intermediate step for classifying the phonemes and words, thereby finding discriminative signal responsible for natural speech synthesis. The proposed network is composed of hierarchical combination of spatial and temporal CNN cascaded with a deep autoencoder. Our best models on the KARA database achieve an average accuracy of 83.42% across the six different binary phonological classification tasks, and 53.36% for the individual token identification task, significantly outperforming our baselines. Ultimately, our work suggests the possible existence of a brain imagery footprint for the underlying articulatory movement related to different sounds that can be used to aid imagined speech decoding.

#7 An Unsupervised Autoregressive Model for Speech Representation Learning [PDF] [Copy] [Kimi1]

Authors: Yu-An Chung ; Wei-Ning Hsu ; Hao Tang ; James Glass

This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing the model to benefit from large quantities of unlabeled data. Speech representations learned by our model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsupervised approaches. Further analysis shows that different levels of speech information are captured by our model at different layers. In particular, the lower layers tend to be more discriminative for speakers, while the upper layers provide more phonetic content.

#8 Harmonic-Aligned Frame Mask Based on Non-Stationary Gabor Transform with Application to Content-Dependent Speaker Comparison [PDF] [Copy] [Kimi1]

Authors: Feng Huang ; Peter Balazs

We propose harmonic-aligned frame mask for speech signals using non-stationary Gabor transform (NSGT). A frame mask operates on the transfer coefficients of a signal and consequently converts the signal into a counterpart signal. It depicts the difference between the two signals. In preceding studies, frame masks based on regular Gabor transform were applied to single-note instrumental sound analysis. This study extends the frame mask approach to speech signals. For voiced speech, the fundamental frequency is usually changing consecutively over time. We employ NSGT with pitch-dependent and therefore time-varying frequency resolution to attain harmonic alignment in the transform domain and hence yield harmonic-aligned frame masks for speech signals. We propose to apply the harmonic-aligned frame mask to content-dependent speaker comparison. Frame masks, computed from voiced signals of a same vowel but from different speakers, were utilized as similarity measures to compare and distinguish the speaker identities (SID). Results obtained with deep neural networks demonstrate that the proposed frame mask is valid in representing speaker characteristics and shows a potential for SID applications in limited data scenarios.

#9 Glottal Closure Instants Detection from Speech Signal by Deep Features Extracted from Raw Speech and Linear Prediction Residual [PDF] [Copy] [Kimi1]

Authors: Gurunath Reddy M. ; K. Sreenivasa Rao ; Partha Pratim Das

Glottal closure instants (GCI) also called as instants of significant excitation occur during abrupt closure of vocal folds is a well-studied problem for its many potential applications in speech processing. Speech signal or its transformed linear prediction residual (LPR) is the most popular signal representations for GCI detection. In this paper, we propose a supervised classification based GCI detection method, in which, we train multiple convolution neural networks to determine the suitable feature representation for efficient GCI detection. Also, we show that the combined model trained with joint acoustic-residual deep features and the model trained with low pass filtered speech significantly increases the detection accuracy. We have manually annotated the speech signal for ground truth GCI using electroglottograph (EGG) as a reference signal. The evaluation results showed that the proposed model trained with very small and less diverse data performs significantly better than the traditional signal processing and most recent data-driven approaches.

#10 Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks [PDF] [Copy] [Kimi1]

Authors: Santiago Pascual ; Mirco Ravanelli ; Joan Serrà ; Antonio Bonafonte ; Yoshua Bengio

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

#11 Excitation Source and Vocal Tract System Based Acoustic Features for Detection of Nasals in Continuous Speech [PDF] [Copy] [Kimi1]

Authors: Bhanu Teja Nellore ; Sri Harsha Dumpala ; Karan Nathwani ; Suryakanth V. Gangashetty

The aim of the current study is to propose acoustic features for detection of nasals in continuous speech. Acoustic features that represent certain characteristics of speech production are extracted. Features representing excitation source characteristics are extracted using zero frequency filtering method. Features representing vocal tract system characteristics are extracted using zero time windowing method. Feature sets are formed by combining certain subsets of the features mentioned above. These feature sets are evaluated for their representativeness of nasals in continuous speech in three different languages, namely, English, Hindi and Telugu. Results show that nasal detection is reliable and consistent across all the languages mentioned above.

#12 Data Augmentation Using GANs for Speech Emotion Recognition [PDF1] [Copy] [Kimi1]

Authors: Aggelina Chatziagapi ; Georgios Paraskevopoulos ; Dimitris Sgouropoulos ; Georgios Pantazopoulos ; Malvina Nikandrou ; Theodoros Giannakopoulos ; Athanasios Katsamanis ; Alexandros Potamianos ; Shrikanth Narayanan

In this work, we address the problem of data imbalance for the task of Speech Emotion Recognition (SER). We investigate conditioned data augmentation using Generative Adversarial Networks (GANs), in order to generate samples for underrepresented emotions. We adapt and improve a conditional GAN architecture to generate synthetic spectrograms for the minority class. For comparison purposes, we implement a series of signal-based data augmentation methods. The proposed GAN-based approach is evaluated on two datasets, namely IEMOCAP and FEEL-25k, a large multi-domain dataset. Results demonstrate a 10% relative performance improvement in IEMOCAP and 5% in FEEL-25k, when augmenting the minority classes.

#13 High Quality, Lightweight and Adaptable TTS Using LPCNet [PDF] [Copy] [Kimi1]

Authors: Zvi Kons ; Slava Shechtman ; Alex Sorin ; Carmel Rabinovitz ; Ron Hoory

We present a lightweight adaptable neural TTS system with high quality output. The system is composed of three separate neural network blocks: prosody prediction, acoustic feature prediction and Linear Prediction Coding Net as a neural vocoder. This system can synthesize speech with close to natural quality while running 3 times faster than real-time on a standard CPU. The modular setup of the system allows for simple adaptation to new voices with a small amount of data. We first demonstrate the ability of the system to produce high quality speech when trained on large, high quality datasets. Following that, we demonstrate its adaptability by mimicking unseen voices using 5 to 20 minutes long datasets with lower recording quality. Large scale Mean Opinion Score quality and similarity tests are presented, showing that the system can adapt to unseen voices with quality gap of 0.12 and similarity gap of 3% compared to natural speech for male voices and quality gap of 0.35 and similarity of gap of 9% for female voices.

#14 Towards Achieving Robust Universal Neural Vocoding [PDF] [Copy] [Kimi1]

Authors: Jaime Lorenzo-Trueba ; Thomas Drugman ; Javier Latorre ; Thomas Merritt ; Bartosz Putrycz ; Roberto Barra-Chicote ; Alexis Moinet ; Vatsal Aggarwal

This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric).

#15 Expediting TTS Synthesis with Adversarial Vocoding [PDF] [Copy] [Kimi1]

Authors: Paarth Neekhara ; Chris Donahue ; Miller Puckette ; Shlomo Dubnov ; Julian McAuley

Recent approaches in text-to-speech (TTS) synthesis employ neural network strategies to vocode perceptually-informed spectrogram representations directly into listenable waveforms. Such vocoding procedures create a computational bottleneck in modern TTS pipelines. We propose an alternative approach which utilizes generative adversarial networks (GANs) to learn mappings from perceptually-informed spectrograms to simple magnitude spectrograms which can be heuristically vocoded. Through a user study, we show that our approach significantly outperforms naïve vocoding strategies while being hundreds of times faster than neural network vocoders used in state-of-the-art TTS systems. We also show that our method can be used to achieve state-of-the-art results in unsupervised synthesis of individual words of speech.

#16 Analysis by Adversarial Synthesis — A Novel Approach for Speech Vocoding [PDF] [Copy] [Kimi1]

Authors: Ahmed Mustafa ; Arijit Biswas ; Christian Bergler ; Julia Schottenhamml ; Andreas Maier

Classical parametric speech coding techniques provide a compact representation for speech signals. This affords a very low transmission rate but with a reduced perceptual quality of the reconstructed signals. Recently, autoregressive deep generative models such as WaveNet and SampleRNN have been used as speech vocoders to scale up the perceptual quality of the reconstructed signals without increasing the coding rate. However, such models suffer from a very slow signal generation mechanism due to their sample-by-sample modelling approach. In this work, we introduce a new methodology for neural speech vocoding based on generative adversarial networks (GANs). A fake speech signal is generated from a very compressed representation of the glottal excitation using conditional GANs as a deep generative model. This fake speech is then refined using the LPC parameters of the original speech signal to obtain a natural reconstruction. The reconstructed speech waveforms based on this approach show a higher perceptual quality than the classical vocoder counterparts according to subjective and objective evaluation scores for a dataset of 30 male and female speakers. Moreover, the usage of GANs enables to generate signals in one-shot compared to autoregressive generative models. This makes GANs promising for exploration to implement high-quality neural vocoders.

#17 Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation [PDF] [Copy] [Kimi]

Authors: Yi-Chiao Wu ; Tomoki Hayashi ; Patrick Lumban Tobing ; Kazuhiro Kobayashi ; Tomoki Toda

In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder hardly generates speech with given F0 values which are outside the range observed in training data. Consequently, the WN vocoder lacks the pitch controllability which is one of the essential capabilities of conventional vocoders. To address this limitation, we propose the PDCNN component which has the time-variant adaptive dilation size related to the given F0 values and a cascade network structure of the QPNet vocoder to generate quasi-periodic signals such as speech. Both objective and subjective tests are conducted, and the experimental results demonstrate the better pitch controllability of the QPNet vocoder compared to the same and double sized WN vocoders while attaining comparable speech qualities.

#18 A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data [PDF] [Copy] [Kimi1]

Authors: Xiaohai Tian ; Eng Siong Chng ; Haizhou Li

In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a novel approach to voice conversion using WaveNet for non-parallel training data. Instead of reconstructing speech with intermediate features, the proposed approach utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform samples directly. In this way, we avoid the estimation errors arising from vocoding and feature conversion. Additionally, as PPG is assumed to be speaker independent, the proposed approach also reduces the feature mismatch problem in WaveNet vocoder based solutions. Experimental results conducted on the CMU-ARCTIC database show that the proposed approach significantly outperforms the traditional vocoder and WaveNet Vocoder baselines in terms of speech quality.

#19 Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Ziping Zhao ; Zhongtian Bao ; Zixing Zhang ; Nicholas Cummins ; Haishuai Wang ; Björn W. Schuller

Discrete speech emotion recognition (SER), the assignment of a single emotion label to an entire speech utterance, is typically performed as a sequence-to-label task. This approach, however, is limited, in that it can result in models that do not capture temporal changes in the speech signal, including those indicative of a particular emotion. One potential solution to overcome this limitation is to model SER as a sequence-to-sequence task instead. In this regard, we have developed an attention-based bidirectional long short-term memory (BLSTM) neural network in combination with a connectionist temporal classification (CTC) objective function (Attention-BLSTM-CTC) for SER. We also assessed the benefits of incorporating two contemporary attention mechanisms, namely component attention and quantum attention, into the CTC framework. To the best of the authors’ knowledge, this is the first time that such a hybrid architecture has been employed for SER.We demonstrated the effectiveness of our approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpora. The experimental results demonstrate that our proposed model outperforms current state-of-the-art approaches.

#20 Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile [PDF] [Copy] [Kimi1]

Authors: Jeng-Lin Li ; Chi-Chun Lee

A growing number of human-centered applications benefit from continuous advancements in the emotion recognition technology. Many emotion recognition algorithms have been designed to model multimodal behavior cues to achieve high performances. However, most of them do not consider the modulating factors of an individual’s personal attributes in his/her expressive behaviors. In this work, we propose a Personalized Attributes-Aware Attention Network (PAaAN) with a novel personalized attention mechanism to perform emotion recognition using speech and language cues. The attention profile is learned from embeddings of an individual’s profile, acoustic, and lexical behavior data. The profile embedding is derived using linguistics inquiry word count computed between the target speaker and a large set of movie scripts. Our method achieves the state-of-the-art 70.3% unweighted accuracy in a four class emotion recognition task on the IEMOCAP. Further analysis reveals that affect-related semantic categories are emphasized differently for each speaker in the corpus showing the effectiveness of our attention mechanism for personalization.

#21 A Saliency-Based Attention LSTM Model for Cognitive Load Classification from Speech [PDF] [Copy] [Kimi1]

Authors: Ascensión Gallardo-Antolín ; Juan Manuel Montero

Cognitive Load (CL) refers to the amount of mental demand that a given task imposes on an individual’s cognitive system and it can affect his/her productivity in very high load situations. In this paper, we propose an automatic system capable of classifying the CL level of a speaker by analyzing his/her voice. Our research on this topic goes into two main directions. In the first one, we focus on the use of Long Short-Term Memory (LSTM) networks with different weighted pooling strategies for CL level classification. In the second contribution, for overcoming the need of a large amount of training data, we propose a novel attention mechanism that uses the Kalinli’s auditory saliency model. Experiments show that our proposal outperforms significantly both, a baseline system based on Support Vector Machines (SVM) and a LSTM-based system with logistic regression attention model.

#22 A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews [PDF] [Copy] [Kimi1]

Authors: Adria Mallol-Ragolta ; Ziping Zhao ; Lukas Stappen ; Nicholas Cummins ; Björn W. Schuller

The high prevalence of depression in society has given rise to a need for new digital tools that can aid its early detection. Among other effects, depression impacts the use of language. Seeking to exploit this, this work focuses on the detection of depressed and non-depressed individuals through the analysis of linguistic information extracted from transcripts of clinical interviews with a virtual agent. Specifically, we investigated the advantages of employing hierarchical attention-based networks for this task. Using Global Vectors (GloVe) pretrained word embedding models to extract low-level representations of the words, we compared hierarchical local-global attention networks and hierarchical contextual attention networks. We performed our experiments on the Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WoZ) dataset, which contains audio, visual, and linguistic information acquired from participants during a clinical session. Our results using the DAIC-WoZ test set indicate that hierarchical contextual attention networks are the most suitable configuration to detect depression from transcripts. The configuration achieves an Unweighted Average Recall (UAR) of .66 using the test set, surpassing our baseline, a Recurrent Neural Network that does not use attention.

#23 Listeners’ Ability to Identify the Gender of Preadolescent Children in Different Linguistic Contexts [PDF] [Copy] [Kimi1]

Authors: Shawn Nissen ; Sharalee Blunck ; Anita Dromey ; Christopher Dromey

This study evaluated listeners’ ability to identify the gender of preadolescent children from speech samples of varying length and linguistic context. The listeners were presented with a total of 190 speech samples in four different categories of linguistic context: segments, words, sentences, and discourse. The listeners were instructed to evaluate each speech sample and decide whether the speaker was a male or female and rate their level of confidence in their decision. Results showed listeners identified the gender of the speakers with a high degree of accuracy, ranging from 86% to 95%. Significant differences in listener judgments were found across the four levels of linguistic context, with segments having the lowest accuracy (83%) and discourse the highest accuracy (99%). At the segmental level, the listeners’ identification of each speaker’s gender was greater for vowels than for fricatives, with both types of phoneme being identified at a rate well above chance. Significant differences in identification were found between the /s/ and /ʃ/ fricatives, but not between the four corner vowels. The perception of gender is likely multifactorial, with listeners possibly using phonetic, prosodic, or stylistic speech cues to determine a speaker’s gender.

#24 Sibilant Variation in New Englishes: A Comparative Sociophonetic Study of Trinidadian and American English /s(tr)/-Retraction [PDF] [Copy] [Kimi1]

Authors: Wiebke Ahlers ; Philipp Meer

The retraction of /s/, particularly in /str/ clusters, toward [ʃ] has been investigated in British, Australian, and American English and shown to be conditioned phonetically and sociolinguistically. To date, however, no research exists on the retraction of /s/ in New Englishes, the nativized Englishes spoken in postcolonial territories like the Caribbean. We take up this research gap and present the results of a large-scale comparative acoustic analysis of /s/-retraction in Trinidadian English (TrinE) and American English (AmE), using Center of Gravity measurements of more than 23,500 sibilants produced by 181 speakers from two speech corpora. The results show that, in TrinE, /str/ is considerably retracted toward [ʃtɹ], while all other /sC(r)/ clusters are non-retracted and acoustically close to singleton /s/; less retracted realizations of /str/ occur across word boundaries. Although a statistically significant contrast is overall maintained between /ʃ/ and the sibilant in /str/, there is considerable overlap across many speakers. The comparison between TrinE and AmE indicates that, while sibilants in TrinE overall show acoustically lower values, both varieties have in common that retraction is limited to /str/ contexts and significantly larger in younger speakers. The degree of /str/-retraction, however, is overall larger in TrinE than AmE.

#25 Tracking the New Zealand English NEAR/SQUARE Merger Using Functional Principal Components Analysis [PDF] [Copy] [Kimi1]

Authors: Michele Gubian ; Jonathan Harrington ; Mary Stevens ; Florian Schiel ; Paul Warren

The focus of the study is the application of functional principal components analysis (FPCA) to a sound change in progress in which the square and near falling diphthongs are merging in New Zealand English. FPCA approximated the trajectory shapes of the first two formant frequencies (F1/F2) in a large acoustic database of read New Zealand English speech spanning three different age groups and two regions. The derived FPCA parameters showed a greater degree of centralisation and monophthongisation in square than in near. Compatibly with the evidence of an ongoing sound change in which square is shifting towards near, these shape differences were more marked for older than for younger/mid-age speakers. There was no effect of region nor of place of articulation of the preceding consonant; there was a trend for the merger to be more advanced in low frequency words. The study underlines the benefits of FPCA for quantifying the many types of sound changes involving subtle shifts in speech dynamics. In particular, multi-dimensional trajectory shape differences can be quantified without the need for vowel targets nor for determining the influence of the parameters — in this case of the first two formant frequencies — independently of each other.