INTERSPEECH.2015 - Others

Total: 424

#1 Learning the speech front-end with raw waveform CLDNNs [PDF] [Copy] [Kimi1]

Authors: Tara N. Sainath ; Ron J. Weiss ; Andrew Senior ; Kevin W. Wilson ; Oriol Vinyals

Learning an acoustic model directly from the raw waveform has been an active area of research. However, waveform-based models have not yet matched the performance of log-mel trained neural networks. We will show that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech. Specifically, we will show the benefit of the CLDNN, namely the time convolution layer in reducing temporal variations, the frequency convolution layer for preserving locality and reducing frequency variations, as well as the LSTM layers for temporal modeling. In addition, by stacking raw waveform features with log-mel features, we achieve a 3% relative reduction in word error rate.

#2 Architectures for deep neural network based acoustic models defined over windowed speech waveforms [PDF] [Copy] [Kimi]

Authors: Mayank Bhargava ; Richard Rose

This paper investigates acoustic models for automatic speech recognition (ASR) using deep neural networks (DNNs) whose input is taken directly from windowed speech waveforms (WSW). After demonstrating the ability of these networks to automatically acquire internal representations that are similar to mel-scale filter-banks, an investigation into efficient DNN architectures for exploiting WSW features is performed. First, a modified bottleneck DNN architecture is investigated to capture dynamic spectrum information that is not well represented in the time domain signal. Second,the redundancies inherent in WSW based DNNs are considered. The performance of acoustic models defined over WSW features is compared to that obtained from acoustic models defined over mel frequency spectrum coefficient (MFSC) features on the Wall Street Journal (WSJ) speech corpus. It is shown that using WSW features results in a 3.0 percent increase in WER relative to that resulting from MFSC features on the WSJ corpus. However, when combined with MFSC features, a reduction in WER of 4.1 percent is obtained with respect to the best evaluated MFSC based DNN acoustic model.

#3 Analysis of CNN-based speech recognition system using raw speech as input [PDF] [Copy] [Kimi1]

Authors: Dimitri Palaz ; Mathew Magimai-Doss ; Ronan Collobert

Automatic speech recognition systems typically model the relationship between the acoustic speech signal and the phones in two separate steps: feature extraction and classifier training. In our recent works, we have shown that, in the framework of convolutional neural networks (CNN), the relationship between the raw speech signal and the phones can be directly modeled and ASR systems competitive to standard approach can be built. In this paper, we first analyze and show that, between the first two convolutional layers, the CNN learns (in parts) and models the phone-specific spectral envelope information of 2-4 ms speech. Given that we show that the CNN-based approach yields ASR trends similar to standard short-term spectral based ASR system under mismatched (noisy) conditions, with the CNN-based approach being more robust.

#4 Bilinear map of filter-bank outputs for DNN-based speech recognition [PDF] [Copy] [Kimi1]

Authors: Tetsuji Ogawa ; Kenshiro Ueda ; Kouichi Katsurada ; Tetsunori Kobayashi ; Tsuneo Nitta

Filter-bank outputs are extended into tensors to yield precise acoustic features for speech recognition using deep neural networks (DNNs). The filter-bank outputs with temporal contexts form a time-frequency pattern of speech and have been shown to be effective as a feature parameter for DNN-based acoustic models. We attempt to project the filter-bank outputs onto a tensor product space using decorrelation followed by a bilinear map to improve acoustic separability in feature extraction. This extension makes extracting a more precise structure of the time-frequency pattern possible because the bilinear map yields higher-order correlations of features. Experimental comparisons carried out in phoneme recognition demonstrate that the tensor feature provides comparable results to the filter-bank feature, and the fusion of the two features yields an improvement over each feature.

#5 Speech recognition with temporal neural networks [PDF] [Copy] [Kimi1]

Authors: Payton Lin ; Dau-Cheng Lyu ; Yun-Fan Chang ; Yu Tsao

Raw temporal features were derived from extracted temporal envelope bank (referred to as “Tbank”). Tbank features were used with deep neural networks (DNNs) to greatly increase the amount of detailed information about the past to be carried forward to help in the interpretation of the future.

#6 Convolutional neural networks for acoustic modeling of raw time signal in LVCSR [PDF] [Copy] [Kimi1]

Authors: Pavel Golik ; Zoltán Tüske ; Ralf Schlüter ; Hermann Ney

In this paper we continue to investigate how the deep neural network (DNN) based acoustic models for automatic speech recognition can be trained without hand-crafted feature extraction. Previously, we have shown that a simple fully connected feedforward DNN performs surprisingly well when trained directly on the raw time signal. The analysis of the weights revealed that the DNN has learned a kind of short-time time-frequency decomposition of the speech signal. In conventional feature extraction pipelines this is done manually by means of a filter bank that is shared between the neighboring analysis windows. Following this idea, we show that the performance gap between DNNs trained on spliced hand-crafted features and DNNs trained on raw time signal can be strongly reduced by introducing 1D-convolutional layers. Thus, the DNN is forced to learn a short-time filter bank shared over a longer time span. This also allows us to interpret the weights of the second convolutional layer in the same way as 2D patches learned on critical band energies by typical convolutional neural networks. The evaluation is performed on an English LVCSR task. Trained on the raw time signal, the convolutional layers allow to reduce the WER on the test set from 25.5% to 23.4%, compared to an MFCC based result of 22.1% using fully connected layers.

#7 Stable and unstable intervals as a basic segmentation procedure of the speech signal [PDF] [Copy] [Kimi1]

Authors: Ulrike Glavitsch ; Lei He ; Volker Dellwo

The concept of acoustically stable and unstable intervals to structure continuous speech is introduced. We present a method to compute stable intervals efficiently and reliably as a bottom-up approach at an early processing stage. We argue that such intervals stand in close relation to the rhythm of speech as they contribute to the overall temporal organization of the speech production process and the acoustic signal (stable intervals = intervals of reduced movement of certain articulators; unstable intervals = intervals of enhanced movement of certain articulators). To test the relationship of stability intervals with speech rhythm we investigated the between-speaker variability of stable and unstable intervals in the TEVOID corpus. Results revealed that significant between-speaker variability exists. We hypothesize from our findings that the basic segmentation of speech into stable and unstable intervals is a process that might play a role in human perception and processing of speech.

#8 Polysyllabic shortening and word-final lengthening in English [PDF] [Copy] [Kimi1]

Authors: Andreas Windmann ; Juraj Šimko ; Petra Wagner

We investigate Polysyllabic shortening effects in three prosodic domains, the word, the inter-stress interval (ISI) and the narrow rhythm unit (NRU), in a large corpus of English broadcast speech. Results confirm and extend earlier findings, indicating that these effects are interpretable as artifacts of word-final lengthening. We do, however, find effects compatible with the assumption of eurhythmic principles in speech production.

#9 The acoustics of word stress in English as a function of stress level and speaking style [PDF] [Copy] [Kimi1]

Authors: Anders Eriksson ; Mattias Heldner

This study of lexical stress in English is part of a series of studies, the goal of which is to describe the acoustics of lexical stress for a number of typologically different languages. When fully developed the methodology should be applicable to any language. The database of recordings so far includes Brazilian Portuguese, English (U.K.), Estonian, German, French, Italian and Swedish. The acoustic parameters examined are f0 -level, f0 -variation, Duration, and Spectral Emphasis. Values for these parameters, computed for all vowels, are the data upon which the analyses are based. All parameters are tested with respect to their correlation with stress level (primary, secondary, unstressed) and speaking style (wordlist reading, phrase reading, spontaneous speech). For the English data, the most robust results concerning stress level are found for Duration and Spectral Emphasis. f0 -level is also significantly correlated but not quite to the same degree. The acoustic effect of phonological secondary stress was significantly different from primary stress only for Duration. In the statistical tests, speaker sex turned out as significant in most cases. Detailed examination showed, however, that the difference was mainly in the degree to which a given parameter was used, not how it was used to signal lexical stress contrasts.

#10 Pitch accent distribution in German infant-directed speech [PDF] [Copy] [Kimi1]

Authors: Katharina Zahner ; Muna Pohl ; Bettina Braun

Infant-directed speech exhibits slower speech rate, higher pitch and larger F0 excursions than adult-directed speech. Apart from these phonetic properties established in many languages, little is known on the intonational phonological structure in individual languages, i.e. pitch accents and boundary tones and their frequency distribution. Here, we investigated the intonation of infant-directed speech in German. We extracted all turns from the CHILDES database directed towards infants younger than one year (n=585). Two annotators labeled pitch accents and boundary tones according to the autosegmental-metrical intonation system GToBI. Additionally, the tonal movement surrounding the accentual syllable was analyzed. Main results showed a) that 45% of the words carried a pitch accent, b) that phrases ending in a low tone were most frequent, c) that H* accents were generally more frequent than L* accents, d) that H*, L+H* and L* are the most frequent pitch accent types in IDS, and e) that a pattern consisting of an accentual low-pitched syllable preceded by a low tone and followed by a rise or a high tone constitutes the most frequent single pattern. The analyses reveal that the IDS intonational properties lead to a speech style with many tonal alternations, particularly in the vicinity of accented syllables.

#11 Acoustic correlates of perceived syllable prominence in German [PDF] [Copy] [Kimi1]

Authors: Hansjörg Mixdorff ; Christian Cossio-Mercado ; Angelika Hönemann ; Jorge Gurlekian ; Diego Evin ; Humberto Torres

This paper explores the relationship between perceived syllable prominence and the acoustic properties of a speech utterance. It is aimed at establishing a link between the linguistic meaning of an utterance in terms of sentence modality and focus and its underlying prosodic features. Applications of such knowledge can be found in computer-based pronunciation training as well as general automatic speech recognition and understanding. Our acoustic analysis confirms earlier results in that focus and sentence mode modify the fundamental frequency contour, syllabic durations and intensity. However, we could not find consistent differences between utterances produced with non-contrastive and contrastive focus, respectively. Only one third of utterances with broad focus were identified as such. Ratings of syllable prominence are strongly correlated with the amplitude Aa of underlying accent commands, syllable duration, maximum intensity and mean harmonics-to-noise ratio.

#12 Cross-modality matching of linguistic and emotional prosody [PDF] [Copy] [Kimi]

Authors: Simone Simonetti ; Jeesun Kim ; Chris Davis

Talkers can express different meanings or emotions without changing what is said by changing how it is said (by using both auditory and/or visual speech cues). Typically, cue strength differs between the auditory and visual channels: linguistic prosody (expression) is clearest in audition; emotional prosody is clearest visually. We investigated how well perceivers can match auditory and visual linguistic and emotional prosodic signals. Previous research showed that perceivers can match linguistic visual and auditory prosody reasonably well. The current study extended this by also testing how well auditory and visual spoken emotion expressions could be matched. Participants were presented a pair of sentences (consisting of the same segmental content) spoken by the same talker and were required to decide whether the pair had the same prosody. Twenty sentences were tested with two types of prosody (emotional vs. linguistic), two talkers, and four matching conditions: auditory-auditory (AA); visual-visual (VV); auditory-visual (AV); and visual-auditory (VA). Linguistic prosody was accurately matched in all conditions. Matching emotional expressions was excellent for VV, poorer for VA, and near chance for AA and AV presentations. These differences are discussed in terms of the relationship between types of auditory and visual cues and task effects.

#13 Pitch scaling as a perceptual cue for questions in German [PDF] [Copy] [Kimi1]

Author: Jan Michalsky

Recent studies on the intonation of German suggest that the phonetic realization may contribute to the signaling of questions. In a previous production study polar questions, alternative questions and continuous statements were found to differ by a gradual increase in pitch scaling of a phonologically identical final rising contour [1]. Based on similar findings for Dutch Haan [2] concludes that the meaning signaled by the phonetic realization indicates an attitude rather than a categorical function. This is supported by Chen's [3] perception studies on question intonation in Dutch, Hungarian and Mandarin Chinese as well as early findings for German by Batliner [4]. This paper investigates whether the phonetic realization of intonation in questions signals the categorical pragmatic function of `interrogativity' or rather a `questioning' attitude. Additionally, we investigate, which phonetic parameter is the decisive cue to this meaning. Three perception studies are reported: a combination of an identification and discrimination task, an imitation task, and a semantic rating task. Results suggest that the phonetic implementation of intonation in German questions signals an attitude rather than a linguistic category and that this function is primarily signaled by the offset of the final rising pitch movement.

#14 Parameterization of prosodic headedness [PDF] [Copy] [Kimi2]

Authors: Uwe D. Reichel ; Katalin Mády ; Štefan Beňuš

Prosodic headedness generally refers to the location of relevant prosodic events at the left or right end of prosodic constituents. In a bottom-up procedure based on a computational F0 stylization we tested several measures to quantify headedness in parametrical and categorical terms for intonation in the accentual phrase (AP) domain. These measures refer to F0 level and range trends as well as to F0 contour patterns within APs. We tested the suitability of this framework for Hungarian and French known to be left- and right-headed, respectively, and applied it to Slovak whose headedness status is yet less clear. The prosodic differences of Hungarian and French were well captured by several of the proposed parameters, so that from their values for Slovak it can be concluded that Slovak tends to be a left-headed language.

#15 Detection of mizo tones [PDF] [Copy] [Kimi2]

Authors: Biswajit Dev Sarma ; Priyankoo Sarmah ; Wendy Lalhminghlui ; S. R. Mahadeva Prasanna

Mizo is a tone language of the Kuki-Chin subfamily of the Tibeto-Burman language family. It is a under-studied language and not much resources are available for the language. Moreover, it is a tone language with four different tones, namely, high, low, falling and rising. While designing a speech recognition system it becomes imperative that tonal variations are taken into consideration. Hence, a tone detection method for Mizo is designed using quantitative analysis of acoustic features of Mizo tones. Traditional methods of modelling requires large data for training. As such database is not available for Mizo, we relied only on the slope and height for detecting Mizo tones. In this method, we first converted the pitch values to z-score values. Then the z-score values are fitted to a line. An analysis is made on the distributions of the variance of the pitch contour, represented by z-scores, to classify the tone as High/Low or Falling/Rising. Then depending on the slope and height values the tone is further classified into High or Low and Rising or Falling, respectively.

#16 The intonation of echo wh-questions [PDF] [Copy] [Kimi1]

Authors: Sophie Repp ; Lena Rosin

The acoustic characteristics of German echo questions are explored in a production study. It is shown that there are prosodic differences (F0, duration, intensity) between echo questions signalling a high level of emotional arousal, echo questions signalling that the speaker did not understand the previous utterance, and questions requesting completely new information. The findings are largely compatible with earlier findings on utterances with different levels of emotional arousal, where e.g. a higher F0 signals a higher emotional arousal but do not confirm expectations with respect to phonological differences formulated on the basis of suggestions in the linguistic literature on echo questions.

#17 Immediately postverbal questions in urdu [PDF] [Copy] [Kimi1]

Authors: Farhat Jabeen ; Tina Bögel ; Miriam Butt

This production study investigates the interaction of prosody, word order and information structure with respect to wh-constituents in Urdu. We contrasted immediately preverbal wh-constituents with immediately postverbal ones. The preverbal position is the default focus position in Urdu; the appearance of wh-constituents in the immediately postverbal position within the verbal complex is not well understood. In order to test various possible factors governing the appearance of immediately postverbal wh-constituents, target sentences with wh-constituents in both pre- and postverbal positions were presented in different pragmatic contexts and given to native speakers to pronounce. The results show a clear difference in prosodic realization between the pre- and the postverbal position. The preverbal position is consistent with focus prosody, the postverbal wh-phrases appear to occur when the verb is in focus.

#18 Prosodic (non-)realisation of broad, narrow and contrastive focus in Hungarian: a production and a perception study [PDF] [Copy] [Kimi1]

Author: Katalin Mády

In languages with variable focus positions, prominent elements tend to be emphasised by prosodic cues (e.g. English). If a language prefers a given prosodic pattern, i.e. sentence-final nuclear accents, like Spanish, the prosodic realisation of broad focus might not differ from that of narrow and contrastive focus. The relevance of prosodic focus marking was tested in Hungarian were focus typically appears in front of the finite verb. Prosodic cues such as F0 maximum, F0 peak alignment, segment duration and post-verbal deaccentuation were tested in an experiment with read question and answer sequences. While narrow and contrastive focus triggered post-verbal deaccentuation, none of the gradual measures distinguished focus types consistently from each other. A subsequent perception experiment was conducted in which the same sentences without post-verbal units were to be judged for their naturalness. F0 maximum, F0 peak alignment and accent duration were manipulated. Naturalness scores revealed a sequence narrow > contrastive > broad focus, i.e. a preference for narrow focus contexts compared to contrastive and broad focus ones, while the manipulated prosodic parameters had no effect on the scores. It is concluded that prosodic focus marking in Hungarian is optional and pragmatic rather than grammatical and syntax-related.

#19 F0 discontinuity as a marker of prosodic boundary strength in lombard speech [PDF] [Copy] [Kimi1]

Authors: Štefan Beňuš ; Uwe D. Reichel ; Juraj Šimko

Prosodic boundary strength (PBS) refers to the degree of disjuncture between two chunks of speech. It is affected by both linguistic and para-linguistic communicative intentions playing thus an important role in both speech generation and recognition tasks. Among several PBS signals, we focus in this paper on pitch-related discontinuities in boundaries conveying linguistically meaningful contrasts produced in increasing levels of ambient noise. We compare several measures of local and global pitch reset and use classifiers in an effort to better understand the relationship between the degree of ambient noise and F0 marking of PBS. Our results include a positive effect of some noise on boundary classification, better performance of local than global reset features, and more systematic behavior of F0 falls compared to rises.

#20 Comparing journalistic and spontaneous speech: prosodic and spectral analysis [PDF] [Copy] [Kimi1]

Authors: Cédric Gendrot ; Martine Adda-Decker ; Yaru Wu

In this study we compare the ESTER corpus of journalistic speech [1] and the NCCF corpus of spontaneous speech [2] in terms of duration, F0 and spectral reduction in productions automatically detected as speech units between pauses. Continuation F0 rises are overall absent in spontaneous speech and speech units reveal a declination slope with less amplitude than in journalistic speech. For both corpora, lengthening starts around 60% of the sequence duration, but significantly less in spontaneous speech. Lengthening in the initial part of the sequence is observed in journalistic speech only. As expected we measure a faster speech rate in spontaneous speech with shorter vowel durations implying — partly only — a more important vowel reduction.

#21 Rhythm influences the tonal realisation of focus [PDF] [Copy] [Kimi1]

Authors: Nadja Schauffler ; Katrin Schweitzer

Several studies suggest that rhythm affects different aspects in speech production and perception. For example, in German, discourse structure is normally marked by pitch accent placement and pitch accent type, however, there is variation that cannot be explained by purely semantic or syntactic factors. Prosody-inherent factors, like rhythm, can contribute to this variation. This becomes evident in prosodically more complex environments: while the prosody of utterances containing one focused constituent is well investigated and rather clear-cut, the prosodic organisation of multiple contrastive foci is less clear. In double-focus constructions, for example, two focused constituents demand prominence, possibly resulting in the realisation of two pitch accents. If these pitch accents are required on adjacent syllables they conflict with rhythmic preferences. We present a sentence reading experiment investigating the tonal realisation of two focused constituents and how their contours affect each other in different rhythmic environments. Specifically, we tested whether a potential pitch accent clash in a sentence with two corrective foci influences the pitch excursion and the absolute peak height of the accented syllables. The results demonstrate that rhythmic constraints affect the organisation of the tonal marking of corrective focus.

#22 Linguistic measures of pitch range in slavic and Germanic languages [PDF] [Copy] [Kimi]

Authors: Bistra Andreeva ; Bernd Möbius ; Grazyna Demenko ; Frank Zimmerer ; Jeanin Jügler

Based on specific linguistic landmarks in the speech signal, this study investigates pitch level and pitch span differences in English, German, Bulgarian and Polish. The analysis is based on 22 speakers per language (11 males and 11 females). Linear mixed models were computed that include various linguistic measures of pitch level and span, revealing characteristic differences across languages and between language groups. Pitch level appeared to have significantly higher values for the female speakers in the Slavic than the Germanic group. The male speakers showed slightly different results, with only the Polish speakers displaying significantly higher mean values for pitch level than the German males. Overall, the results show that the Slavic speakers tend to have a wider pitch span than the German speakers. But for the linguistic measure, namely for span between the initial peaks and the non-prominent valleys, we only find the difference between Polish and German speakers. We found a flatter intonation contour in German than in Polish, Bulgarian and English male and female speakers and differences in the frequency of the landmarks between languages. Concerning “speaker liveliness” we found that the speakers from the Slavic group are significantly livelier than the speakers from the Germanic group.

#23 The effect of stress on vowel space in daxi hakka Chinese [PDF] [Copy] [Kimi1]

Authors: Chunan Qiu ; Jie Liang

The present study examined the effect of stress on vowels, specifically duration, formant frequency and acoustic vowel space in CV syllable in Daxi Hakka Chinese. F1 and F2 values were measured at three equidistant time locations. Results show that the absence of stress results in the reduction of vowel duration, which is shortened by 23% in the unstressed condition. The presence of stress affects formants by raising F2 for front vowels, raising F1 for low vowels, and lowering F2 for back vowels. Space areas in the stressed condition are significantly greater than their unstressed counterparts in all three measurement points. The mean space areas are compressed by 17% from the stressed to the unstressed condition. The mean space areas in 50% point are the largest in the two stress conditions. There is a positive correlation between interspeaker duration and vowel space areas.

#24 Declination, peak height and pitch level in declaratives and questions of south connaught irish [PDF] [Copy] [Kimi1]

Authors: Maria O'Reilly ; Ailbhe Ní Chasaide

As South Connaught Irish typically uses the same (falling) tune type in both questions and declaratives, this paper examines whether sentence mode might be differentiated in this dialect by other aspects of the contour realization, namely declination slope, peak height and pitch level. A set of matched declaratives (DEC), wh- questions (WHQ) and yes/no questions (YNQ) of two phrase lengths (with 2 and 3 accent groups) was analysed. The results indicate that sentence type is reflected in the measured F0 parameters. Compared to declaratives, WHQ exhibit markedly steeper declination slopes and somewhat higher IP-initial peaks, while YNQ raise the pitch level and the IP-initial peaks. Phrase length influences declination slope, but does not appear to affect peak height.

#25 Contextual variation of tones in mizo [PDF] [Copy] [Kimi1]

Authors: Priyankoo Sarmah ; Leena Dihingia ; Wendy Lalhminghlui

Mizo is a Tibeto-Burman language belonging to the Kuki-Chin subfamily and it has four lexical tones, namely, high, low, rising and falling. Contextual influence on tones of Mizo is investigated in this study. Trisyllabic Mizo phrases are recorded with the four Mizo tones in H_H, R_R, L_L and F_F contexts. The target word is also recorded in isolation. Both carryover and anticipatory influences were found in various degrees. In case of low tone targets, preceding tones with high offset raises the target low tone. In case of rising tone targets, following tones with high onset reduce the tone to L tone. The preceding tone does not affect the target tone, in this case. Finally, the results of this study are discussed in comparison to contextual variations reported in other tone languages such as Cantonese, Mandarin Chinese and Thai.