INTERSPEECH.2014 - Others | Cool Papers - Immersive Paper Discovery

#1 Direct F0 control of an electrolarynx based on statistical excitation feature prediction and its evaluation through simulation [PDF] [Copy] [Kimi]

Authors: Kou Tanaka ; Tomoki Toda ; Graham Neubig ; Sakriani Sakti ; Satoshi Nakamura

An electrolarynx is a device that artificially generates excitation sounds to enable laryngectomees to produce electrolaryngeal (EL) speech. Although proficient laryngectomees can produce quite intelligible EL speech, it sounds very unnatural due to the mechanical excitation produced by the device. To address this issue, we have proposed several EL speech enhancement methods using statistical voice conversion and showed that statistical prediction of excitation parameters, such as F0 patterns, was essential to significantly improve naturalness of EL speech. In these methods, the original EL speech is recorded with a microphone and the enhanced EL speech is presented from a loudspeaker in real time. This framework is effective for telecommunication but it is not suitable to face-to-face conversation because both the original EL speech and the enhanced EL speech are presented to listeners. In this paper, we propose direct F0 control of the electrolarynx based on statistical excitation prediction to develop an EL speech enhancement technique also effective for face-to-face conversation. F0 patterns of excitation signals produced by the electrolarynx are predicted in real time from the EL speech produced by the laryngectomee's articulation of the excitation signals with previously predicted F0 values. A simulation experiment is conducted to evaluate the effectiveness of the proposed method. The experimental results demonstrate that the proposed method yields significant improvements in naturalness of EL speech while keeping its intelligibility high enough.

#2 A target approximation intonation model for yorùbá TTS [PDF] [Copy] [Kimi¹]

Authors: Daniel R. van Niekerk ; Etienne Barnard

A complete intonation model based on quantitative target approximation is described for Yorùbá text-to-speech (TTS) synthesis. This model is evaluated analytically and perceptually and compared to a fundamental frequency (F0) model using the standard HTS implementation. Analytical results suggest that the proposed approach more efficiently models F0 contours given typical data constraints in under-resourced environments and perceptual results comparing the proposed model with HTS are encouraging.

#3 Learning continuous-valued word representations for phrase break prediction [PDF] [Copy] [Kimi¹]

Authors: Anandaswarup Vadapalli ; Kishore Prahallad

Phrase break prediction is the first step in modeling prosody for text-to-speech systems (TTS). Traditional methods of phrase break prediction have used discrete linguistic representations (like POS tags, induced POS tags, word-terminal syllables) for modeling these breaks. However these discrete representations suffer from a number of issues such as fixing the number of discrete classes and also such a representation does not capture the co-occurrence statistics of the words. As a result, the use of continuous valued word representation was proposed in literature. In this paper, we propose a neural network dictionary learning architecture to induce task specific continuous valued word representations, and show that these task specific features perform better at phrase break prediction as compared to continuous features derived using Latent Semantic Analysis (LSA).

#4 Improving Mandarin prosodic boundary prediction with rich syntactic features [PDF] [Copy] [Kimi¹]

Authors: Hao Che ; Jianhua Tao ; Ya Li

Previous researches indicated that the performance of automatic prosodic boundary labeling benefited from syntactic phrase information for Mandarin. However, the influence of other syntactic features such as dependency has not been studied in-depth yet, especially on large scale corpus. This paper demonstrates the usefulness of rich syntactic features for Mandarin phrase boundary prediction. Both syntactic phrase and dependency features are considered in our methods. The experimental results show that rich syntactic features improve the performance of prosodic boundary prediction effectively.

#5 Investigating automatic & human filled pause insertion for speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Rasmus Dall ; Marcus Tomalin ; Mirjam Wester ; William Byrne ; Simon King

Filled pauses are pervasive in conversational speech and have been shown to serve several psychological and structural purposes. Despite this, they are seldom modelled overtly by state-of-the-art speech synthesis systems. This paper seeks to motivate the incorporation of filled pauses into speech synthesis systems by exploring their use in conversational speech, and by comparing the performance of several automatic systems inserting filled pauses into fluent text. Two initial experiments are described which seek to determine whether people's predicted insertion points are consistent with actual practice and/or with each other. The experiments also investigate whether there are `right' and `wrong' places to insert filled pauses. The results show good consistency between people's predictions of usage and their actual practice, as well as a perceptual preference for the `right' placement. The third experiment contrasts the performance of several automatic systems that insert filled pauses into fluent sentences. The best performance (determined by F-score) was achieved through the by-word interpolation of probabilities predicted by Recurrent Neural Network and 4gram Language Models. The results offer insights into the use and perception of filled pauses by humans, and how automatic systems can be used to predict their locations.

#6 The effect of filled pauses and speaking rate on speech comprehension in natural, vocoded and synthetic speech [PDF] [Copy] [Kimi¹]

Authors: Rasmus Dall ; Mirjam Wester ; Martin Corley

It has been shown that in natural speech filled pauses can be beneficial to a listener. In this paper, we attempt to discover whether listeners react in a similar way to filled pauses in synthetic and vocoded speech compared to natural speech. We present two experiments focusing on reaction time to a target word. In the first, we replicate earlier work in natural speech, namely that listeners respond faster to a target word following a filled pause than following a silent pause. This is replicated in vocoded but not in synthetic speech. Our second experiment investigates the effect of speaking rate on reaction times as this was potentially a confounding factor in the first experiment. Evidence suggests that slower speech rates lead to slower reaction times in synthetic and in natural speech. Moreover, in synthetic speech the response to a target word after a filled pause is slower than after a silent pause. This finding, combined with an overall slower reaction time, demonstrates a shortfall in current synthesis techniques. Remedying this could help make synthesis less demanding and more pleasant for the listener, and reaction time experiments could thus provide a measure of improvement in synthesis techniques.

#7 Introducing i-vectors for joint anti-spoofing and speaker verification [PDF] [Copy] [Kimi¹]

Authors: Elie Khoury ; Tomi Kinnunen ; Aleksandr Sizov ; Zhizheng Wu ; Sébastien Marcel

Any biometric recognizer is vulnerable to direct spoofing attacks and automatic speaker verification (ASV) is no exception; replay, synthesis and conversion attacks all provoke false acceptances unless countermeasures are used. We focus on voice conversion (VC) attacks. Most existing countermeasures use full knowledge of a particular VC system to detect spoofing. We study a potentially more universal approach involving generative modeling perspective. Specifically, we adopt standard i-vector representation and probabilistic linear discriminant analysis (PLDA) back-end for joint operation of spoofing attack detector and ASV system. As a proof of concept, we study a vocoder-mismatched ASV and VC attack detection approach on the NIST 2006 speaker recognition evaluation corpus. We report stand-alone accuracy of both the ASV and countermeasure systems as well as their combination using score fusion and joint approach. The method holds promise.

#8 Random projections for large-scale speaker search [PDF] [Copy] [Kimi¹]

Authors: Ryan Leary ; Walter Andrews

This paper describes a system for indexing acoustic feature vectors for large-scale speaker search using random projections. Given one or more target feature vectors, large-scale speaker search enables returning similar vectors (in a nearest-neighbors fashion) in sublinear time. The speaker feature space is comprised of i-vectors, derived from Gaussian Mixture Model supervectors. The index and search algorithm is derived from locality sensitive hashing with novel approaches in neighboring bin approximation for improving the miss rate at specified false alarm thresholds. The distance metric for determining the similarity between vectors is the cosine distance. This approach significantly reduced the search space by 70% with minimal increase in miss rate. When combined with further dimensionality reduction, a reduction of the search space by over 90% is also possible. All experiments are based on the NIST SRE 2010 evaluation.

#9 Analysis of i-vector framework for speaker identification in TV-shows [PDF] [Copy] [Kimi]

Authors: Corinne Fredouille ; Delphine Charlet

Inspired from the Joint Factor Analysis, the I-vector-based analysis has become the most popular and state-of-the-art framework for the speaker verification task. Mainly applied within the NIST/SRE evaluation campaigns, many studies have been proposed to improve more and more performance of speaker verification systems. Nevertheless, while the i-vector framework has been used in other speech processing fields like language recognition, a very few studies have been reported for the speaker identification task on TV shows. This work was done in the REPERE challenge context, focused on the people recognition task in multimodal conditions (audio, video, text) from TV show corpora. Moreover, the challenge participants are invited for providing systems for monomodal tasks, like speaker identification. The application of the i-vector framework is investigated through different points of views: (1) some of the i-vector based approaches are compared, (2) a specific i-vector extraction protocol is proposed in order to deal with widely varying amounts of training data among speaker population, (3) the joint use of both speaker diarization and identification is finally analyzed. Based on a 533 speaker dictionary, this joint system wins the monomodal speaker identification task of the 2014 REPERE challenge.

#10 Boosting bonsai trees for efficient features combination: application to speaker role identification [PDF] [Copy] [Kimi¹]

Authors: Antoine Laurent ; Nathalie Camelin ; Christian Raymond

In this article, we tackle the problem of speaker role detection from broadcast news shows. In the literature, many proposed solutions are based on the combination of various features coming from acoustic, lexical and semantic information with a machine learning algorithm. Many previous studies mention the use of boosting over decision stumps to combine efficiently these features. In this work, we propose a modification of this state-of-the-art machine learning algorithm changing the weak learner (decision stumps) by small decision trees, denoted bonsai trees. Experiments show that using bonsai trees as weak learners for the boosting algorithm largely improves both system error rate and learning time.

#11 Identifying contributors in the BBC world service archive [PDF] [Copy] [Kimi]

Authors: Yves Raimond ; Thomas Nixon

In this paper we describe the speaker identification feature of the BBC World Service Archive prototype, an experiment run by BBC R&D to investigate alternative ways of publishing large radio archives. This feature relies on diarization of individual programmes, supervector-based speaker models, crowdsourcing for speaker identities, and a fast distributed index based on Locality Sensitive Hashing techniques to propagate these identities. We also describe how crowdsourced data can be used to continuously evaluate and refine our mapping from speaker models to speaker identities. We believe this experiment is one of the largest of its kind.

#12 Effect of long-term ageing on i-vector speaker verification [PDF] [Copy] [Kimi]

Authors: Finnian Kelly ; Rahim Saeidi ; Naomi Harte ; David A. van Leeuwen

Assessing the impact of ageing on biometric systems is an important challenge. In this paper, an i-vector speaker verification framework is used to evaluate the impact of long-term ageing on state-of-the-art speaker verification. Using the Trinity College Dublin Speaker Ageing (TCDSA) database, it is observed that the performance of the i-vector system, in terms of both discrimination and calibration, degrades progressively as the absolute age difference between training and testing samples increases. In the case of male speakers, the equal error rate (EER) increases from 4.61% at an ageing difference of 0–1 years to 32.74% at an age difference of 51–60 years. The performance of a Gaussian Mixture Model - Universal Background Model (GMM-UBM) system is presented for comparison. It is shown that while the i-vector system outperforms the GMM-UBM system, as absolute age difference increases, the performance of both degrades at a similar rate. It is concluded that long-term ageing variability is distinct from everyday intersession variability, and therefore must be dealt with via dedicated compensation strategies.

#13 Acoustic correlates of phonological status [PDF] [Copy] [Kimi¹]

Authors: Maarten Versteegh ; Amanda Seidl ; Alejandrina Cristia

Languages vary not only in terms of their sound inventory, but also in the phonological status certain sound distinctions are assigned. For example, while vowel nasality is lexically contrastive (phonemic) in Quebecois French, it is largely determined by the context (allophonic) in American English; the reverse is true for vowel tenseness. If phonetics and phonology interact, a minimal pair of sounds should span a larger acoustic divergence when it is pronounced by speakers for whom the underlying distinction is phonemic compared to allophonic. Near minimal pairs were segmented from a corpus of American English and Quebecois French using a crossed design (since nasality and tenseness have opposite phonological status in the two languages). Pairwise time-aligned divergences between contrasts were calculated on the basis of 7 mainstream spoken feature representations, and a set of linguistic phonetic measurements. Only carefully selected phonetic measurements revealed the expected cross-over, with larger divergences for English than French tokens of the tenseness contrast, and larger divergences for French than English tokens for the nasality contrast. We conclude that the phonetic effects of phonological status are subtle enough that only linguistically-informed (or supervised) measurements can pick up on them.

#14 Parameterization of the glottal source with the phase plane plot [PDF] [Copy] [Kimi¹]

Authors: Manu Airaksinen ; Paavo Alku

Parameterization of the glottal flow is a process where the glottal flow is represented in terms of a few numerical values. This study proposes a novel parameterization technique called the phase plane symmetry (PPS) parameter that utilizes the symmetrical properties of the phase plane plot. Phase plane is a way to graphically visualize the glottal source in a 2-dimensional space spanned by two amplitude-domain axes. A correctly normalized phase plane plot has also close ties to the normalized amplitude quotient (NAQ) parameter, and it is shown that the inverse NAQ value is represented as a single point in the phase plane plot. The experiments conducted in this study support that PPS is powerful in discriminating between various phonation types and within the same range of robustness as the NAQ parameter.

#15 Transcribing tone — a likelihood-based quantitative evaluation of chao's tone letters [PDF] [Copy] [Kimi¹]

Author: Phil Rose

The accuracy of the widely used and International Phonetic Association-sanctioned Chao five-point scale of tonal transcription is examined quantitatively. Perceptually transformed acoustic data are used from two Chinese dialects with complex tone systems, and a measure derived of the conformability of the data using their likelihoods. It is shown that some tones conform well to the model, but others do not, with tonal pitch targets lying equidistant between the Chao integers. It is concluded that the Chao model is probably not an accurate reflection of the distribution of tonal pitch targets.

#16 Intonational phonology and prosodic hierarchy in malay [PDF] [Copy] [Kimi¹]

Authors: Diyana Hamzah ; James Sneed German

This paper presents original data in support of a new model of intonational phonology for Malay as spoken in Singapore. Building on the Autosegmental-Metrical approach (Beckman & Pierrehumbert, 1986), we propose that intonational variation in Malay can be explained in terms of underlying sequences of abstract tonal units (H and L), which are aligned to the edges and internal syllables of prosodic phrases organized in a hierarchy. Data was drawn from a production experiment (Hamzah, 2012) involving declarative utterances under different focus patterns in a question-answer context, as well as from story-telling interviews and TV interviews. We find evidence for at least three levels of prosodic organization: (i) an accentual phrase which comprises one or more words and bears an L and H tone at its left and right edges, respectively, (ii) an intermediate phrase, which serves as the domain of catathesis, and (iii) an intonational phrase, which may span the entire utterance and bears an additional H or L tone at its right edge. Differences in F0 peak alignment for focused words support the presence of a focus pitch accent. We outline a series of follow-up studies for extending the model further.

#17 Comparing parameterizations of pitch register and its discontinuities at prosodic boundaries for Hungarian [PDF] [Copy] [Kimi¹]

Authors: Uwe D. Reichel ; Katalin Mády

We examined how well prosodic boundary strength can be captured by two declination stylization methods as well as by four different representations of pitch register. In the stylization proposed by Liebermann et al. (1985) base- and topline are fitted to peaks and valleys of the pitch contour, whereas in Reichel & Mády (2013) these lines are fitted to medians below and above certain pitch percentiles. From each of the stylizations four feature pools were induced representing different aspects of register discontinuity at word boundaries: discontinuities related to the base-, mid-, and topline, as well as to the range between base- and topline. Concerning stylization the median-based fitting approach turned out to be more robust with respect to declination line crossing errors and yielded base-, topline and range-related discontinuity characteristics with higher correlations to perceived boundary strength. Concerning register representation, for the peak/valley fitting approach the base- and topline patterns showed weaker correspondences to boundary strength than the other feature pools. We furthermore trained generalized linear regression models for boundary strength prediction on each feature pool. It turned out that neither the stylization method nor the register representation had a significant influence on the overall good prediction performance.

#18 An evaluation of machine learning methods for prominence detection in French [PDF] [Copy] [Kimi¹]

Authors: George Christodoulides ; Mathieu Avanzi

The automatic detection of prosodically prominent syllables is crucial for analysing speech, especially in French where prominence contributes substantially to prosodic grouping and boundary demarcation. In this paper, we compare different machine learning techniques for the automatic detection of prominent syllables, using prosodic features (including pitch, energy, duration and spectral balance) and lexical information. We explore the differences between modelling the detection of prominent syllables as a classification or as a sequence labelling problem, and combinations of the two techniques. We train and evaluate our systems on a corpus of spontaneous French speech, consisting of almost 100 different speakers; the corpus is balanced for speaker age and sex and covers 3 different regional varieties. The result of this study is a novel tool for the automatic annotation of prominent syllables in French.

#19 Investigating the effect of F0 and vocal intensity on harmonic magnitudes: data from high-speed laryngeal videoendoscopy [PDF] [Copy] [Kimi¹]

Authors: Gang Chen ; Soo Jin Park ; Jody Kreiman ; Abeer Alwan

The relative magnitude of the first two harmonics of the voice source (H1*-H2*) is an important measure and is assumed to be one exponent of changes in vocal quality along a breathy-to-pressed continuum. H1*-H2* is often associated with glottal open quotient (OQ) and glottal pulse skewness (as quantified by speed quotient, SQ), but may also covary with fundamental frequency (F0) and vocal intensity. We examined the relationship between H1*-H2*, F0, and vocal intensity using phonations in which vocal qualities varied continuously in F0 and intensity. Glottal area measures (OQ and SQ) and acoustic measures (F0, intensity, and H1*-H2*) were studied using simultaneously-collected laryngeal high-speed videoendoscopy and audio recordings from 9 subjects. Analyses of individual speakers showed that H1*-H2* may sometimes vary as a function of F0 alone, with OQ and SQ remaining rather constant, hypothetically when nonlinear source-filter interaction is strong. Although conventionally H1*-H2* is assumed to decrease with increasing vocal intensity due to a decrease in OQ, results showed examples where H1*-H2* increased with increasing vocal intensity, hypothetically when the effect of decreasing pulse skewness exceeds the effect of decreasing OQ. In some phonatory modes, the relationship between SQ and H1*-H2* may not be as monotonic as previously assumed.

#20 Adapting prosodic chunking algorithm and synthesis system to specific style: the case of dictation [PDF] [Copy] [Kimi¹]

Authors: Elisabeth Delais-Roussarie ; Damien Lolive ; Hiyon Yoo ; Nelly Barbot ; Olivier Rosec

In this paper, we present an approach that allows a TTS-system to dictate texts to primary school pupils, while being in conformity with the prosodic features of this speaking style. The approach relies on the elaboration of a preprocessing prosodic module that avoids developing a specific system for a so limited task. The proposal is based on two distinct elements: (i) the results of a preliminary evaluation that allowed getting feedback from potential users; (ii) a corpus study of 10 dictations annotated or uttered by 13 teachers or speech therapists (10 and 3 respectively). The preliminary evaluation focused on three points: the accuracy of the segmentation procedure, the size of the automatically calculated chunks, and the intelligibility of the synthesized voice. It showed that the chunks were judged too long, and the speaking rate too fast. We thus decided to work on these two issues while analyzing the collected data, and confronting the obtained realizations with the outcome of the speech synthesis system and the chunking algorithm. The results of the analysis lead to propose a module that provides for this speaking style an enriched text that can be treated by the synthesizer to constrain the unit selection and the prosodic realization.

#21 The articulation of lexical and post-lexical palatalization in Korean [PDF] [Copy] [Kimi¹]

Author: Jae-Hyun Sung

Palatalization in Korean is of two types — lexical palatalization governed by language-specific phonological rules, and post-lexical palatalization that appears to be purely phonetic. While lexical palatalization only occurs when a morpheme boundary intervenes between a target consonant and a palatalization trigger, post-lexical palatalization occurs irrespective of the presence of a morpheme boundary. This study investigates whether these two types of palatalization and different morphological structures of words manifest as distinct tongue gestures using ultrasound imaging of 4 native speakers of Korean. Comparison of the ultrasound tongue contours shows that the gestural distinction between lexical and post-lexical palatalization may not be the same across individual speakers. Furthermore, the effects of morpheme boundaries are not uniform across different coronal consonants and speakers in terms of tongue gestures. The findings from this study provide further empirical evidence for the role of morphological structures in coarticulation, and are in line with mounting evidence for speaker-specific variability in speech production.

#22 Articulation and neutralization: a preliminary study of lenition in scottish gaelic [PDF] [Copy] [Kimi¹]

Authors: Diana Archangeli ; Samuel Johnston ; Jae-Hyun Sung ; Muriel Fisher ; Michael Hammond ; Andrew Carnie

Initial Consonant Mutation in Scottish Gaelic is considered to be morphological, somewhat idiosyncratic, and neutralizing, that is merging either the mutated sound and some underlying sound or merging two mutated sounds. This study explores articulation in one class of mutation, called Lenition (also Aspiration), asking the question of whether these sounds are articulated in the same fashion or not. Comparison of relevant ultrasound images collected from 3 native speakers of Scottish Gaelic shows that speakers maintain distinctions between True Lenition and False Lenition, suggesting that there is incomplete neutralization. Furthermore, when Lenition of two distinct sounds converge on the same target, subjects again keep the two articulations distinct. These results are consistent with a phonological model which distinguishes between surface forms corresponding to different sources, showing very little complete articulatory neutralisation.

#23 Nasality in speech and its contribution to speaker individuality [PDF] [Copy] [Kimi¹]

Authors: Kanae Amino ; Hisanori Makinae ; Tatsuya Kitamura

The term nasality refers to the timbre of the nasal phonemes. It is also used to express the quality of sound that characterises some speakers. In this paper, we propose to classify nasality in natural speech into four types: phonemic nasality, nasality in assimilation, incidental nasality in the production of voiced plosives, and nasality associated with speaker individuality. Speech sounds recorded separately for oral and nasal outputs were analysed and the four types of nasality were observed individually. In order to investigate the relationship between the nasality in running speech and the perception of speaker similarity, we conducted an experiment. The results revealed that listeners rated speaker similarity exploiting phonemic nasality when it existed in the utterance and also used speaker-related nasality regardless of the existence of phonemic nasals.

#24 Is speech rhythm an intrinsic property of language? [PDF] [Copy] [Kimi¹]

Authors: Jason Brown ; Eden Matene

Different languages have traditionally been classified into different rhythm types. Most studies of rhythm have either implicitly or explicitly accepted that rhythm is an inherent property of a language. This study aims to determine whether rhythm is an intrinsic property of languages, or whether rhythm is an epiphenomenal byproduct of the phonotactic structures of a given stimulus. The question that this project addresses is to what extent the phonological properties of a language can be correlated with rhythmic categories; for instance, whether a language has consonant clusters, makes use of contrastive tone, has complex syllables, exhibits vowel reduction, etc. and whether these can be linked to what kind of rhythmic profile a language fits into.

#25 Where /ar/ the /r/s in standard austrian German? [PDF] [Copy] [Kimi¹]

Authors: Anke Jackschina ; Barbara Schuppler ; Rudolf Muhr

The present paper investigates the conditions under which different realizations of /R/ occur in standard Austrian German. The study is based on 509 word tokens containing the phone sequence /aR/ in coda position drawn from a corpus of read speech from seven male Austrian radio speakers. Acoustic measurements of the vowel /a/ revealed that F1, F2 and F3 are significant predictors for the realization of /R/ as either trill, fricative or as absent. Moreover, /a/ tends to be longer when /R/ is absent than when it is present. Our analysis of the linguistic conditions for the different realizations of /R/ showed that /R/ is least reduced in stressed syllables and in words read in isolation. Furthermore, we observe that the segmental context significantly affects the realization of /R/. Most importantly, we find significant effects of morphology: /R/ tends to be more reduced when it is part of a grammatical morpheme than when it is part of the stem of a word. These findings inform the further development of models of pronunciation variation for human and automatic speech recognition.