INTERSPEECH.2020 - Others

Total: 408

#1 Towards Learning a Universal Non-Semantic Representation of Speech [PDF] [Copy] [Kimi1]

Authors: Joel Shor ; Aren Jansen ; Ronnie Maor ; Oran Lang ; Omry Tuval ; Félix de Chaumont Quitry ; Marco Tagliasacchi ; Ira Shavitt ; Dotan Emanuel ; Yinnon Haviv

The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective. The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The benchmark4, models5, and evaluation code6 are publicly released.

#2 Poetic Meter Classification Using i-Vector-MTF Fusion [PDF] [Copy] [Kimi1]

Authors: Rajeev Rajan ; Aiswarya Vinod Kumar ; Ben P. Babu

In this paper, a deep neural network (DNN)-based poetic meter classification scheme is proposed using a fusion of musical texture features (MTF) and i-vectors. The experiment is performed in two phases. Initially, the mel-frequency cepstral coefficient (MFCC) features are fused with MTF and classification is done using DNN. MTF include timbral, rhythmic, and melodic features. Later, in the second phase, the MTF is fused with i-vectors and classification is performed. The performance is evaluated using a newly created poetic corpus in Malayalam, one of the prominent languages in India. While the MFCC-MTF/DNN system reports an overall accuracy of 80.83%, the i-vector/MTF fusion reports an overall accuracy of 86.66%. The performance is also compared with a baseline support vector machine (SVM)-based classifier. The results show that the architectural choice of i-vector fusion with MTF on DNN has merit in recognizing meters from recited poems.

#3 Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism [PDF] [Copy] [Kimi1]

Authors: Wang Dai ; Jinsong Zhang ; Yingming Gao ; Wei Wei ; Dengfeng Ke ; Binghuai Lin ; Yanlu Xie

Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for formant tracking. In addition to the conventional implementation, we modified the architecture from three aspects. First, we turned off the “causal” mode of dilated convolution, making the dilated convolution see the future speech frames. Second, each hidden layer reused the output information from all the previous layers through dense connection. Third, we also adopted a gating mechanism to alleviate the problem of gradient disappearance by selectively forgetting unimportant information. The model was validated on the open access formant database VTR. The experiment showed that our proposed model was easy to converge and achieved an overall mean absolute percent error (MAPE) of 8.2% on speech-labeled frames, compared to three competitive baselines of 9.4% (LSTM), 9.1% (Bi-LSTM) and 8.9% (TCN).

#4 Automatic Analysis of Speech Prosody in Dutch [PDF] [Copy] [Kimi1]

Authors: Na Hu ; Berit Janssen ; Judith Hanssen ; Carlos Gussenhoven ; Aoju Chen

In this paper we present a publicly available tool for automatic analysis of speech prosody (AASP) in Dutch. Incorporating the state-of-the-art analytical frameworks, AASP enables users to analyze prosody at two levels from different theoretical perspectives. Holistically, by means of the Functional Principal Component Analysis (FPCA) it generates mathematical functions that capture changes in the shape of a pitch contour. The tool outputs the weights of principal components in a table for users to process in further statistical analysis. Structurally, AASP analyzes prosody in terms of prosodic events within the auto-segmental metrical framework, hypothesizing prosodic labels in accordance with Transcription of Dutch Intonation (ToDI) with accuracy comparable to similar tools for other languages. Published as a Docker container, the tool can be set up on various operating systems in only two steps. Moreover, the tool is accessed through a graphic user interface, making it accessible to users with limited programming skills.

#5 Learning Voice Representation Using Knowledge Distillation for Automatic Voice Casting [PDF] [Copy] [Kimi1]

Authors: Adrien Gresse ; Mathias Quillot ; Richard Dufour ; Jean-François Bonastre

The search for professional voice-actors for audiovisual productions is a sensitive task, performed by the artistic directors (ADs). The ADs have a strong appetite for new talents/voices but cannot perform large scale auditions. Automatic tools able to suggest the most suited voices are of a great interest for audiovisual industry. In previous works, we showed the existence of acoustic information allowing to mimic the AD’s choices. However, the only available information is the ADs’ choices from the already dubbed multimedia productions. In this paper, we propose a representation-learning based strategy to build a character/role representation, called p-vector. In addition, the large variability between audiovisual productions makes it difficult to have homogeneous training datasets. We overcome this difficulty by using knowledge distillation methods to take advantage of external datasets. Experiments are conducted on video-game voice excerpts. Results show a significant improvement using the p-vector, compared to the speaker-based x-vector representation.

#6 Enhancing Formant Information in Spectrographic Display of Speech [PDF] [Copy] [Kimi1]

Authors: B. Yegnanarayana ; Anand Joseph ; Vishala Pannala

Formants are resonances of the time varying vocal tract system, and their characteristics are reflected in the response of the system for a sequence of impulse-like excitation sequence originated at the glottis. This paper presents a method to enhance the formants information in the display of spectrogram of the speech signal, especially for high pitched voices. It is well known that in the narrowband spectrogram, the presence of pitch harmonics masks the formant information, whereas in the wideband spectrogram, the formant regions are smeared. Using single frequency filtering (SFF) analysis, we show that the wideband equivalent SFF spectrogram can be modified to enhance the formant information in the display by improving the frequency resolution. For this, we obtain two SFF spectrograms by using single frequency filtering of the speech signal at two closely spaced roots on the real axis in the z-plane. The ratio or difference of the two SFF spectrograms is processed to enhance the formant information in the spectrographic display. This will help in tracking rapidly changing formants and in resolving closely spaced formants. The effect is more pronounced in the case of high-pitched voices, like female and children speech.

#7 Unsupervised Methods for Evaluating Speech Representations [PDF] [Copy] [Kimi1]

Authors: Michael Gump ; Wei-Ning Hsu ; James Glass

Disentanglement is a desired property in representation learning and a significant body of research has tried to show that it is a useful representational prior. Evaluating disentanglement is challenging, particularly for real world data like speech, where ground truth generative factors are typically not available. Previous work on disentangled representation learning in speech has used categorical supervision like phoneme or speaker identity in order to disentangle grouped feature spaces. However, this work differs from the typical dimension-wise view of disentanglement in other domains. This paper proposes to use low-level acoustic features to provide the structure required to evaluate dimension-wise disentanglement. By choosing well-studied acoustic features, grounded and descriptive evaluation is made possible for unsupervised representation learning. This work produces a toolkit for evaluating disentanglement in unsupervised representations of speech and evaluates its efficacy on previous research.

#8 Robust Pitch Regression with Voiced/Unvoiced Classification in Nonstationary Noise Environments [PDF] [Copy] [Kimi1]

Authors: Dung N. Tran ; Uros Batricevic ; Kazuhito Koishida

Accurate voiced/unvoiced information is crucial in estimating the pitch of a target speech signal in severe nonstationary noise environments. Nevertheless, state-of-the-art pitch estimators based on deep neural networks (DNN) lack a dedicated mechanism for robustly detecting voiced and unvoiced segments in the target speech in noisy conditions. In this work, we proposed an end-to-end deep learning-based pitch estimation framework which jointly detects voiced/unvoiced segments and predicts pitch values for the voiced regions of the ground-truth speech. We empirically showed that our proposed framework significantly more robust than state-of-the-art DNN based pitch detectors in nonstationary noise settings. Our results suggest that joint training of voiced/unvoiced detection and voiced pitch prediction can significantly improve pitch estimation performance.

#9 Nonlinear ISA with Auxiliary Variables for Learning Speech Representations [PDF] [Copy] [Kimi1]

Authors: Amrith Setlur ; Barnabás Póczos ; Alan W. Black

This paper extends recent work on nonlinear Independent Component Analysis ( ica) by introducing a theoretical framework for nonlinear Independent Subspace Analysis ( isa) in the presence of auxiliary variables. Observed high dimensional acoustic features like log Mel spectrograms can be considered as surface level manifestations of nonlinear transformations over individual multivariate sources of information like speaker characteristics, phonological content etc. Under assumptions of energy based models we use the theory of nonlinear isa to propose an algorithm that learns unsupervised speech representations whose subspaces are independent and potentially highly correlated with the original non-stationary multivariate sources. We show how nonlinear ica with auxiliary variables can be extended to a generic identifiable model for subspaces as well while also providing sufficient conditions for the identifiability of these high dimensional subspaces. Our proposed methodology is generic and can be integrated with standard unsupervised approaches to learn speech representations with subspaces that can theoretically capture independent higher order speech signals. We evaluate the gains of our algorithm when integrated with the Autoregressive Predictive Coding ( apc) model by showing empirical results on the speaker verification and phoneme recognition tasks.

#10 Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals [PDF] [Copy] [Kimi1]

Authors: Hirotoshi Takeuchi ; Kunio Kashino ; Yasunori Ohishi ; Hiroshi Saruwatari

Convolutional neural networks have been successfully applied to a variety of audio signal processing tasks including sound source separation, speech recognition and acoustic scene understanding. Since many pitched sounds have a harmonic structure, an operation, called harmonic convolution, has been proposed to take advantages of the structure appearing in the audio signals. However, the computational cost involved is higher than that of normal convolution. This paper proposes a faster calculation method of harmonic convolution called Harmonic Lowering. The method unrolls the input data to a redundant layout so that the normal convolution operation can be applied. The analysis of the runtimes and the number of multiplication operations show that the proposed method accelerates the harmonic convolution 2 to 7 times faster than the conventional method under realistic parameter settings, while no approximation is introduced.

#11 End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors [PDF] [Copy] [Kimi1]

Authors: Shota Horiguchi ; Yusuke Fujita ; Shinji Watanabe ; Yawen Xue ; Kenji Nagamatsu

End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69% diarization error rate (DER) on simulated mixtures and a 8.07% DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56% and 9.54%, respectively. In unknown numbers of speakers conditions, our method attained a 15.29% DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43% DER.

#12 Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario [PDF] [Copy] [Kimi1]

Authors: Ivan Medennikov ; Maxim Korenevsky ; Tatiana Prisyach ; Yuri Khokhlov ; Mariya Korenevskaya ; Ivan Sorokin ; Tatiana Timofeeva ; Anton Mitrofanov ; Andrei Andrusenko ; Ivan Podluzhny ; Aleksandr Laptev ; Aleksei Romanenko

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

#13 New Advances in Speaker Diarization [PDF] [Copy] [Kimi1]

Authors: Hagai Aronowitz ; Weizhong Zhu ; Masayuki Suzuki ; Gakuto Kurata ; Ron Hoory

Recently, speaker diarization based on speaker embeddings has shown excellent results in many works. In this paper we propose several enhancements throughout the diarization pipeline. This work addresses two clustering frameworks: agglomerative hierarchical clustering (AHC) and spectral clustering (SC). First, we use multiple speaker embeddings. We show that fusion of x-vectors and d-vectors boosts accuracy significantly. Second, we train neural networks to leverage both acoustic and duration information for scoring similarity of segments or clusters. Third, we introduce a novel method to guide the AHC clustering mechanism using a neural network. Fourth, we handle short duration segments in SC by deemphasizing their effect on setting the number of speakers. Finally, we propose a novel method for estimating the number of clusters in the SC framework. The method takes each eigenvalue and analyzes the projections of the SC similarity matrix on the corresponding eigenvector. We evaluated our system on NIST SRE 2000 CALLHOME and, using cross-validation, we achieved an error rate of 5.1%, going beyond state-of-the-art speaker diarization.

#14 Self-Attentive Similarity Measurement Strategies in Speaker Diarization [PDF] [Copy] [Kimi1]

Authors: Qingjian Lin ; Yu Hou ; Ming Li

Speaker diarization can be described as the process of extracting sequential speaker embeddings from an audio stream and clustering them according to speaker identities. Nowadays, deep neural network based approaches like x-vector have been widely adopted for speaker embedding extraction. However, in the clustering back-end, probabilistic linear discriminant analysis (PLDA) is still the dominant algorithm for similarity measurement. PLDA works in a pair-wise and independent manner, which may ignore the positional correlation of adjacent speaker embeddings. To address this issue, our previous work proposed the long short-term memory (LSTM) based scoring model, followed by the spectral clustering algorithm. In this paper, we further propose two enhanced methods based on the self-attention mechanism, which no longer focuses on the local correlation but searches for similar speaker embeddings in the whole sequence. The first approach achieves state-of-the-art performance on the DIHARD II Eval Set (18.44% DER after resegmentation), while the second one operates with higher efficiency.

#15 Speaker Attribution with Voice Profiles by Graph-Based Semi-Supervised Learning [PDF] [Copy] [Kimi]

Authors: Jixuan Wang ; Xiong Xiao ; Jian Wu ; Ranjani Ramamurthy ; Frank Rudzicz ; Michael Brudno

Speaker attribution is required in many real-world applications, such as meeting transcription, where speaker identity is assigned to each utterance according to speaker voice profiles. In this paper, we propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods. A graph of speech segments is built for each session, on which segments from voice profiles are represented by labeled nodes while segments from test utterances are unlabeled nodes. The weight of edges between nodes is evaluated by the similarities between the pretrained speaker embeddings of speech segments. Speaker attribution then becomes a semi-supervised learning problem on graphs, on which two graph-based methods are applied: label propagation (LP) and graph neural networks (GNNs). The proposed approaches are able to utilize the structural information of the graph to improve speaker attribution performance. Experimental results on real meeting data show that the graph based approaches reduce speaker attribution error by up to 68% compared to a baseline speaker identification approach that processes each utterance independently.

#16 Deep Self-Supervised Hierarchical Clustering for Speaker Diarization [PDF] [Copy] [Kimi1]

Authors: Prachi Singh ; Sriram Ganapathy

The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker clustering along with a representation learning framework. The proposed approach is based on principles of self-supervised learning where the self-supervision is derived from the clustering algorithm. The representation learning network is trained with a regularized triplet loss using the clustering solution at the current step while the clustering algorithm uses the deep embeddings from the representation learning step. By combining the self-supervision based representation learning along with the clustering algorithm, we show that the proposed algorithm improves significantly (29% relative improvement) over the AHC algorithm with cosine similarity for a speaker diarization task on CALLHOME dataset. In addition, the proposed approach also improves over the state-of-the-art system with PLDA affinity matrix with 10% relative improvement in DER.

#17 Spot the Conversation: Speaker Diarisation in the Wild [PDF] [Copy] [Kimi1]

Authors: Joon Son Chung ; Jaesung Huh ; Arsha Nagrani ; Triantafyllos Afouras ; Andrew Zisserman

The goal of this paper is speaker diarisation of videos collected ‘in the wild’. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from ‘in the wild’ videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

#18 Secondary Phonetic Cues in the Production of the Nasal Short-a System in California English [PDF] [Copy] [Kimi1]

Authors: Georgia Zellou ; Rebecca Scarborough ; Renee Kemp

A production study explored the acoustic characteristics of /æ/ in CVC and CVN words spoken by California speakers who raise /æ/ in pre-nasal contexts. Results reveal that the phonetic realization of the /æ/-/ε/ contrast in these contexts is multidimensional. Raised pre-nasal /æ/ is close in formant space to /ε/, particularly over the second half of the vowel. Yet, systematic differences in the realization of the secondary acoustic features of duration, formant movement, and degree of coarticulatory vowel nasalization keep these vowels phonetically distinct. These findings have implications for systems of vowel contrast and the use of secondary phonetic properties to maintain lexical distinctions.

#19 Acoustic Properties of Strident Fricatives at the Edges: Implications for Consonant Discrimination [PDF] [Copy] [Kimi1]

Authors: Louis-Marie Lorin ; Lorenzo Maselli ; Léo Varnet ; Maria Giavazzi

Languages tend to license segmental contrasts where they are maximally perceptible, i.e. where more perceptual cues to the contrast are available. For strident fricatives, the most salient cues to the presence of voicing are low-frequency energy concentrations and fricative duration, as voiced fricatives are systematically shorter than voiceless ones. Cross-linguistically, the voicing contrast is more frequently realized word-initially than word-finally, as for obstruents. We investigate the phonetic underpinnings of this asymmetric behavior at the word edges, focusing on the availability of durational cues to the contrast in the two positions. To assess segmental duration, listeners rely on temporal markers, i.e. jumps in acoustic energy which demarcate segmental boundaries, thereby facilitating duration discrimination. We conducted an acoustic analysis of word-initial and word-final strident fricatives in American English. We found that temporal markers are sharper at the left edge of word-initial fricatives than at the right edge of word-final fricatives, in terms of absolute value of the intensity slope, in the high-frequency region. These findings allow us to make predictions about the availability of durational cues to the voicing contrast in the two positions.

#20 Processes and Consequences of Co-Articulation in Mandarin V1N.(C2)V2 Context: Phonology and Phonetics [PDF] [Copy] [Kimi]

Author: Mingqiong Luo

It is well known that in Mandarin Chinese (MC) nasal rhymes, non-high vowels /a/ and /e/ undergo Vowel Nasalization and Backness Feature Specification processes to harmonize with the nasal coda in both manner and place of articulation. Specifically, the vowel is specified with the [+front] feature when followed by the /n/ coda and the [+back] feature when followed by /ŋ/. On the other hand, phonetic experiments in recent researches have shown that in MC disyllabic words, the nasal coda tends to undergo place assimilation in the V1N.C2V2 context and complete deletion in the V1N.V2 context. These processes raises two questions: firstly, will V1 in V1N.C2V2 contexts also change in its backness feature to harmonize with the assimilated nasal coda? Secondly, will the duration of V1N reduce significantly after nasal coda deletion in the V1N.(G)V context? A production experiment and a perception experiment were designed to answer these two questions. Results show that the vowel backness feature of V1 is not re-specified despite the appropriate environment, and the duration of V1N is not reduced after nasal deletion. The phonological consequences of these findings will be discussed.

#21 Voicing Distinction of Obstruents in the Hangzhou Wu Chinese Dialect [PDF] [Copy] [Kimi1]

Authors: Yang Yue ; Fang Hu

This paper gives an acoustic phonetic description of the obstruents in the Hangzhou Wu Chinese dialect. Based on the data from 8 speakers (4 male and 4 female), obstruents were examined in terms of VOT, silent closure duration, segment duration, and spectral properties such as H1-H2, H1-F1 and H1-F3. Results suggest that VOT cannot differentiate the voiced obstruents from their voiceless counterparts, but the silent closure duration can. There is no voiced aspiration. And breathiness was detected on the vowel following the voiced category of obstruents. An acoustic consequence is that there is no segment for the voiced glottal fricative [ɦ], since it was realized as the breathiness on the following vowel. But interestingly, it is observed that syllables with [ɦ] are longer than their onset-less counterparts.

#22 The Phonology and Phonetics of Kaifeng Mandarin Vowels [PDF] [Copy] [Kimi1]

Author: Lei Wang

In this present study, we re-analyze the vowel system in Kaifeng Mandarin, adopting a phoneme-based approach. Our analysis deviates from the previous syllable-based analyses in a number of ways. First, we treat apical vowels [ɿ ʅ] as syllabic approximants and analyze them as allophones of the retroflex approximant /ɻ/. Second, the vowel inventory is of three sets, monophthongs, diphthongs and retroflex vowels. The classification of monophthongs and diphthongs is based on the phonological distribution of the coda nasal. That is, monophthongs can be followed by a nasal coda, while diphthongs cannot. This argument has introduced two new opening diphthongs /eε ɤʌ/ in the inventory, which have traditionally been described as monophthongs. Our phonological characterization of the vowels in Kaifeng Mandarin is further backed up by acoustic data. It is argued that the present study has gone some way towards enhancing our understanding of Mandarin segmental phonology in general.

#23 Microprosodic Variability in Plosives in German and Austrian German [PDF] [Copy] [Kimi1]

Authors: Margaret Zellers ; Barbara Schuppler

Fundamental frequency (F0) contours may show slight, microprosodic variations in the vicinity of plosive segments, which may have distinctive patterns relative to the place of articulation and voicing. Similarly, plosive bursts have distinctive characteristics associated with these articulatory features. The current study investigates the degree to which such microprosodic variations arise in two varieties of German, and how the two varieties differ. We find that microprosodic effects indeed arise in F0 as well as burst intensity and Center of Gravity, but that the extent of the variability is different in the two varieties under investigation, with northern German tending towards more variability in the microprosody of plosives than Austrian German. Coarticulatory effects on the burst with the following segment also arise, but also have different features in the two varieties. This evidence is consistent with the possibility that the fortis-lenis contrast is not equally stable in Austrian German and northern German.

#24 Er-Suffixation in Southwestern Mandarin: An EMA and Ultrasound Study [PDF] [Copy] [Kimi1]

Authors: Jing Huang ; Feng-fan Hsieh ; Yueh-chin Chang

This paper is an articulatory study of the er-suffixation (a.k.a. erhua) in Southwestern Mandarin (SWM), using co-registered EMA and ultrasound. Data from two female speakers in their twenties were analyzed and discussed. Our recording materials contain unsuffixed stems, er-suffixed forms and the rhotic schwa [ɚ], a phonemic vowel in its own right. Results suggest that the er-suffixation in SWM involves suffixing a rhotic schwa [ɚ] to the stem, unlike its counterpart in Beijing and Northeastern Mandarin [5]. Specifically, an entire rime will be replaced with the er-suffix if the nucleus vowel is non-high; only high vocoids will be preserved after the er-suffixation. The “rhoticity” is primarily realized as a bunched tongue shape configuration (i.e. a domed tongue body), while the Tongue Tip gesture plays a more limited role in SWM. A phonological analysis is accordingly proposed for the er-suffixation in SWM.

#25 Electroglottographic-Phonetic Study on Korean Phonation Induced by Tripartite Plosives in Yanbian Korean [PDF] [Copy] [Kimi1]

Authors: Yinghao Li ; Jinghua Zhang

This paper examined the phonatory features induced by the tripartite plosives in Yanbian Korean, broadly considered as Hamkyungbukdo Korean dialect. Electroglottographic (EGG) and acoustic analysis was applied for five elderly Korean speakers. The results show that fortis-induced phonation is characterized with more constricted glottis, slower spectral tilt, and higher sub-harmonic-harmonic ratio. Lenis-induced phonation is shown to be breathier with smaller Contact Quotient and faster spectral tilt. Most articulatory and acoustic measures for the aspirated are shown to be patterned with the lenis; However, sporadic difference between the two indicates that the lenis induces more breathier phonation. The diplophonia phonation is argued to be a salient feature for the fortis-head syllables in Yanbian Korean. The vocal fold medial compression and adductive tension mechanisms are tentatively argued to be responsible for the production of the fortis. At last, gender difference is shown to be salient in the fortis-induced phonation.