| Total: 786
In this paper we introduce a novel dynamic programming algorithm called Information Retrieval-based Dynamic Time Warping (IR-DTW) used to find non-linearly matching subsequences between two time series where matching start and end points are not known a priori. In this paper our algorithm is applied for audio matching within the query by example (QbE) spoken term detection (STD) task, although it is applicable to many other problems. The main advantages of the proposed algorithm in comparison to similar approaches are twofold. On the one hand, IR-DTW requires a much smaller memory footprint than standard Dynamic Time Warping (DTW) approaches. On the other hand, it allows for the application of indexing techniques to the search collection for increased matching speed, which makes IR-DTW suitable for application in large scale implementations. We show through preliminary experimentation with a QbE-STD task that the memory footprint is greatly reduced in comparison to a baseline subsequence-DTW (S-DTW) implementation and that its matching accuracy is much better than that of pure diagonal matching and just slightly worse than that of S-DTW.
Factor automaton is an efficient data structure for representing all factors (substrings) of a set of strings (e.g. a finite-state automaton). This notion can be generalized to weighted automata by associating a weight to each factor. In this paper, we consider the problem of computing expected document frequency (DF), and TF-IDF statistics for all substrings seen in a collection of word lattices by means of factor automata. We present an algorithm which transforms an acyclic weighted automaton, e.g. an ASR lattice, to a weighted factor automaton where the path weight of each factor represents the total weight associated by the input automaton to the set of strings including that factor at least once. We show how this automaton can be used to efficiently construct other types of weighted factor automata representing DF and TF-IDF statistics for all factors seen in a large speech corpus. Compared to the state-of-the-art in computing these statistics from spoken documents, our approach i) generalizes the statistics from single words to contiguous substrings, ii) provides significant gains in terms of average run-time and storage requirements and iii) constructs efficient inverted index structures for retrieval of such statistics. Experiments on a Turkish data set corroborate our claims. Acceleration of Spoken Term Detection Using a
We previously proposed a fast spoken term detection method that uses a suffix array data structure for searching large-scale speech documents. The method reduces search time via techniques such as keyword division and iterative lengthening search. In this paper, we propose a statistical method of assigning different threshold values to sub-keywords to further accelerate search. Specifically, the method estimates the numbers of results for keyword searches and then reduces them by adjusting the threshold values assigned to sub-keywords. We also investigate the theoretical condition that must be satisfied by these threshold values. Experiments show that the proposed search method is 10% to 30% faster than previous methods.
We present design strategies for a keyword spotting (KWS) system that operates in highly degraded channel conditions with very low signal-to-noise ratio levels. We employ a system combination approach by combining the outputs of multiple large vocabulary automatic speech recognition (LVCSR) systems, each of which employs a different system design approach targeting three different levels of information: front-end signal processing features (standard cepstra-based, noise-robust modulation and multi layer perceptron features), statistical acoustic models (gaussian mixtures models (GMM) and subspace GMMs) and keyword search strategies (word-based and phone-based). We also use keyword-aware capabilities in the system at two levels: in the LVCSR language models by assigning higher weights to n-grams with keywords in them and in LVCSR search by using a relaxed pruning threshold for keywords. The LVCSR system outputs are represented as lattice-based unigram indices whose scores are fused by a logistic-regression based classifier to produce the final system combination output. We present the performance of our system in the phase II evaluations of DARPA's Robust Automatic Transcription of Speech (RATS) program for both Levantine Arabic and Farsi conversational speech corpora.
The combination of several heterogeneous systems is known to provide remarkable performance improvements in verification and detection tasks. In Spoken Term Detection (STD), two important issues arise: (1) how to define a common set of detected candidates, and (2) how to combine system scores to produce a single score per candidate. In this paper, a discriminative calibration/fusion approach commonly applied in speaker and language recognition is adopted for STD. Under this approach, we first propose several heuristics to hypothesize scores for systems that do not detect a given candidate. In this way, the original problem of several unaligned detection candidates is converted into a verification task. As for other verification tasks, system weights and offsets are then estimated through linear logistic regression. As a result, the combined scores are well calibrated, and the detection threshold is automatically given by application parameters (priors and costs). The proposed method not only offers an elegant solution for the problem of fusion and calibration of multiple detectors, but also provides consistent improvements over a baseline approach based on majority voting, according to experiments on the MediaEval 2012 Spoken Web Search (SWS) task involving 8 heterogeneous systems developed at two different laboratories.
Triphone acoustic models are often used as subword models for detecting out-of-vocabulary query terms in Spoken Term Detection (STD) systems. Our preliminary experiments revealed that the training data for a large portion of the approximately 8,000 triphone models are insufficient. Assuming that such insufficient models deteriorate the performance of STD, this paper proposes intensive triphone models constructed by integrating low-occurrence triphone models into high-occurrence ones. Experiments conducted using an actual lecture speech corpus showed that the proposed method improves the STD performance with regard to both triphones and demiphones, demonstrating its effectiveness.
This paper focuses on an early instance of the concept of voice onset time (VOT) which is traditionally associated with the work of Leigh Lisker and Arthur Abramson in the 1960s. Evidence is presented here that the idea behind VOT . if not the name . is much older than commonly thought. A publication of an Armenian scientist who worked at the Abbe Rousselot's experimental phonetics laboratory at the Sorbonne in the late 19th century is discussed. That researcher studied several varieties of Armenian and categorized them in terms of what is nowadays called VOT. His paper can be regarded as an exemplar of "modern" phonetic thinking long before the evolution of digital sound processing.
This acoustic study explored dialect effects on realization of nuclear pitch accents in three regional varieties of American English spoken in central Ohio, southeastern Wisconsin and western North Carolina. Fundamental frequency (F0) change from vowel onset to offset in the most prominent syllable in a sentence was examined along four parameters: maximum F0 change, relative location of F0 maximum, F0 offset and F0 fall from maximum to offset. A robust finding was that the F0 contours in the Southern (North Carolina) variants were significantly distinct from the two Midwestern varieties whose contours did not differ significantly from one another. The Southern vowels had an earlier F0 rise, a greater F0 fall and a lower F0 offset than either Ohio or Wisconsin vowels. There was a sharper F0 drop preceding a voiceless than a voiced syllable coda. No significant dialect-related differences were found for flat F0 contours in unstressed vowels, which were also examined in the study. This study contributes the finding that dynamic variations in pitch are greater for vowels which also exhibit a greater amount of spectral dynamics. The interaction of these two sets of cues contributes to the melodic component associated with a specific regional accent.
This study investigates the possibility to recover the voice strength, i.e. the sound level produced by the speaker, from the signal recorded. The dataset consists of a set of isolated vowels (720 tokens) recorded in a situation where two interlocutors interacted orally at a distance comprised between 0.40 and 6 meters, in a furnished room. For each token, voice strength is measured at the intensity peak, and several sets of acoustic cues are extracted from the signal spectrum, after frequency weighting and intensity normalization. In the first phase, the tokens are grouped into increasing voice strength categories. Discriminant Analysis produces a classifier which takes into account all the signal dimensions implicitly coded in the set of cues. In the second phase, the cues of a new token are given to the classifier, which in turn produces its distances to the groups, providing the basis for estimating the unknown voice strength. The quality of the process is evaluated either in self-consistency mode or by cross-validation, i.e. by comparing the estimate with the value initially measured on the same token. The statistical margin of error is quite low, of the order of 3 dB, depending on the sets of cues used.
The increased vocal effort associated with the Lombard reflex produces speech that is perceived as louder and judged to be more intelligible in noise than normal speech. Previous work illustrates that, on average, Lombard increases in loudness result from boosting spectral energy in a frequency band spanning the range of formants F1-F3, particularly for voiced speech. Observing additionally that increases in loudness across spoken sentences are spectro-temporally localized, the goal of this work is to further isolate these regions of maximal loudness by linking them to specific formant trends, explicitly considering here the vowel formant separation. For both normal and Lombard speech, this work illustrates that, as loudness increases in frequency bands containing formants (e.g. F1-F2 or F2-F3), the observed separation between formant frequencies decreases. From a production standpoint, these results seem to highlight a physiological trait associated with how humans increase the loudness of their speech, namely moving vocal tract resonances closer together. Particularly, for Lombard speech, this phenomena is exaggerated: that is, the Lombard speech is louder and formants in corresponding spectro-temporal regions are even closer together.
A method to compute the acoustic characteristics of a simplified three-dimensional vocal-tract model with asymmetric wall impedances is presented. The acoustic field is represented in terms of both plane waves and higher-order modes in tubes. This model is constructed using a connected structure of rectangular acoustic tubes, and can parametrically represent acoustic characteristics in higher frequencies where the assumption of plane wave propagation does not hold. The propagation constants of the plane waves and the higher-order modes are calculated taking account of the asymmetric distribution of the wall impedances which can be specified as different values on four sides of each rectangular tube. The frequency characteristics of the propagation constants and the transfer characteristics of multiple section models are presented.
This study presents a new glottal inverse filtering (GIF) technique based on the closed phase analysis over multiple fundamental periods. The proposed Quasi Closed Phase Analysis (QCP) method utilizes Weighted Linear Prediction (WLP) with a specific Attenuated Main Excitation (AME) weighting function that attenuates the contribution of the glottal source in the linear prediction model optimization. This enables the use of the autocorrelation criterion in linear prediction in comparison to the conventional covariance criterion used in the closed phase analysis. The proposed method was compared to previously developed methods by using a synthetic vowel database created with a physical modeling approach. The obtained objective measures show that the proposed method improves the GIF performance for both low- and high-pitched vowels.
German knows two plateau-based phrase-final intonation contours: the high level plateau of the continuation rise and the descending plateau sequence of the calling contour. They occur within a narrow scaling range of only a few semitones. The paper presents production and perception evidence for a third plateau-based phrase-final intonation contour inside this narrow scaling range. The new plateau contour shows a F0 decrease of between 1.3 st (in the form of a slightly declining plateau or a descending plateau sequence), involves additional lengthening of the vowels underneath the plateau, and occurs when resistance is futile, i.e. when speakers signal that they finally, but reluctantly, give in to a demand of the dialogue partner. Phonological implications are briefly outlined.
The presented study concerns the influence of the syllabic structure on perceived prominence. We examined how gaps in the F0 contour due to unvoiced consonants affect prominence perception, given that such gaps can either be filled or blinded out by listeners. For this purpose we created a stimulus set of real disyllabic words which differed in the quantity of the vowel of the accented syllable nucleus and the types of subsequent intervocalic consonant(s). Results include, inter alia, that stimuli with unvoiced gaps in the F0 contour are indeed perceived as less prominent. The prominence reduction is smaller for monotonous stimuli than for stimuli with F0 excursions across the accented syllable. Moreover, in combination with F0 excursions, it also mattered whether F0 had to be interpolated or extrapolated, and whether or not the gap included a fricative sound. The results support both the filling-in and blinding-out of F0 gaps, which fits in well with earlier experiments on the production and perception of pitch.
Rapid Prosody Transcription (RPT) was used to investigate listenersf perceptions of prosody in reading by native and non-native English speakers. RPT offers a language-independent tool to access listeners' holistic understanding of prosody. Listeners hear an audio recording of speech while following along on an orthographic, unpunctuated transcript of the recording. They indicate their perception of phrasal boundaries or prominent words by marking them on the transcript in real time. Our listeners showed higher agreement for boundary-marking in the native speakers' reading than the non-nativesf. Listeners marked more boundaries in the non-natives' reading, likely because the non-natives paused more often, although listeners partially compensated by not marking boundaries as often when non-natives made short pauses. For prominence, rates of agreement were higher for the non-natives. This may be due to listeners' marking fewer prominences in the non-natives' reading, meaning that they agreed on the absence of prominent words. Compared to acoustic analysis, studying listener reactions provides more insight into what aspects of non-native prosody are most salient. This may be useful in guiding learners to the most effective ways to improve their prosody.
This study aims to investigate native speakers' perception of prosodic variation of Chinese utterances. It is known that timing is crucial for intelligibility in English, Japanese and other accent languages [9, 13 ,14 ,16]. A tone language, Chinese relies heavily on the use of pitch more than do other languages for the purpose of distinguishing the meaning of the segmentally same words as well as expressing intonation. It is expected that pitch is the most important prosodic factor for the naturalness judgment of Chinese. However, no empirical data have been presented to support this view to date. In general, listeners are more sensitive to the deviation of timing than pitch. Pitch change also triggers slight change in duration. Whether the significance of timing for naturalness is universal across languages and applies to tone languages as well, relative importance of timing and pitch in Chinese was investigated using L2 Chinese speech. The results indicated that Chinese native listeners do notice the deviation of timing and regard it as accented speech.
The fundamental frequency of a complex sound modulates the perceived duration of a sound. Higher pitch sounds are perceived longer compared to lower pitch sounds as shown by several independent studies since 1973. In this paper, the effect of language background is studied: native speakers of Finnish and German participated in a two alternative forced choice duration discrimination experiment where the duration and frequency of two sounds are randomly varied. The overall duration discrimination sensitivity was similar to both groups but the speakers of Finnish were influenced more by the pitch in their judgements. In addition, the difference in the two sounds' pitch period explained the response data better than the difference in pitch frequencies or the pitch interval. As the Finnish quantity system is known to employ both duration and pitch cues, the present results suggest that the speakers are shaped by the language environment even when the task is purely non-linguistic.
In many cases of turn transition in conversation, a new speaker may respond to phonetic cues from the end of the prior turn, including variation in prosodic features such as pitch and final lengthening. Although consistent pitch and lengthening features are well-established for some languages at potential points of turn transition, this is not necessarily the case for Swedish. The current study uses a two-alternative forced choice task to investigate how variation in pitch contour and lengthening at the ends of syntactically complete turns can influence listeners' expectations of turn hold or turn transition. Both lengthening and pitch contour features were found to influence listeners' judgments about whether turn transition would occur, with shorter length and higher final pitch peaks associated with turn hold. Furthermore, listeners were more certain about their judgments when asked about turn-hold rather than turn-change, suggesting an imbalance in the strength of turn-hold versus turn-transition cues.
Glottalization is often associated with low pitch in intonation languages, but evidence from many languages indicates that this is not an obligatory association. We asked speakers of German, English and Swedish to compare glottalized stimuli with several pitch contour alternatives in an AXB listening test. Although the low F0 in the glottalized stimuli tended to be perceived as most similar to falling pitch contours, this was not always the case, indicating that pitch perception in glottalization cannot be predicted by F0 alone. We also found evidence for cross-linguistic differences in the degree of flexibility of pitch judgments in glottalized stretches of speech.
This paper presents the results of a pitch accent categorisation simulation which attempts to classify L*H and H*L accents using a psychologically motivated exemplar-theoretic model of categorisation. Pitch accents are represented in terms of six linguistically meaningful parameters describing their shape. No additional information is employed in the categorisation process. The results indicate that these accents can be successfully categorised, via exemplar-based comparison, using a limited number of purely tonal features.
In two speeded acceptability experiments we tested which combination of prenuclear accent, nuclear accent and F0-interpolation between them is best suited to signal a double contrast in German (i.e., a contrastive topic followed by a contrastive focus). The experimental utterances differed in the prenuclear accent (medialvs. late-peak, i.e., L+H* vs. L*+H), the nuclear accent (early- vs. medial-peak, i.e., H+L* vs. H*) and the F0-interpolation between them (high or dipping). All utterances were judged for their acceptability in a contrastive (Experiment 1) and a non-contrastive context (control Experiment 2). Our results showed that the combination of a late-peak prenuclear accent (L*+H) and an early-peak nuclear accent (H+L*) is best suited to signal a double contrast, independent of the F0-interpolation. The reaction time data also support the view that the F0-interpolation is not necessary for the interpretation of a double contrast.
Previous research has reported stress "deafness" for languages with predictable stress, like French, contrary to languages with nonpredictable stress, like Spanish. The contrastive nature of stress appears to inhibit stress "deafness", but segmental and/or suprasegmental cues may also enhance stress discrimination. In this study we carried out two experiments aiming to investigate stress perception in European Portuguese (EP), a language with non-predictable stress that utilizes duration and vowel reduction as main cues to stress. We used nonsense words that differed only in stress location, thus removing vowel reduction as a cue to stress. Experiment 1 was an ABX discrimination task. Experiment 2 was a sequence recall task. In both experiments, the stress contrast condition was compared with a phoneme control condition, in nuclear and post-nuclear position. Results of both experiments strongly suggest a stress "deafness" effect in EP. Despite its variable nature, word stress is hardly perceived by EP native-speakers in the absence of vowel reduction. These findings have implications for claims on prosodic-based cross-linguistic perception of word stress in the absence of vowel quality, and for stress "deafness" as a consequence of a predictable stress grammar.
The perception of prosodic prominence is influenced by different sources like different acoustic cues, linguistic expectations and context. We use a generalized additive model and a random forest to model the perceived prominence on a corpus of spoken German. Both models are able to explain over 80% of the variance. While the random forests give us some insights on the relative importance of the cues, the general additive model gives us insights on the interaction between different cues to prominence.
The effect of time-compression and -expansion on the perception of speech rate differences is investigated. Natural utterances were compared with modified versions time-scaled to the same duration. A set of ten German sentences was produced by one native speaker at slow and fast speed. In a forced choice discrimination task 15 participants were asked to select the faster one of two versions of the same sentence. In the case of low speech rate, versions that had been slowed down were perceived as slower than the corresponding natural utterances, whereas at high speech rates, stimuli with increased speed were judged as relatively faster. The effect turned out to be stronger for the slow stimuli. These findings suggest that the underlying articulatory effort plays an important role in the perception of speech rate.
Vowels are generally described with static articulatory configurations represented by targets in the acoustic space: typically, formant frequencies in the F1-F2 and F2-F3 planes. Plosive consonants can be described in terms of places of articulation, represented by locus or locus equations in an acoustic plane. But how are a given vowel and a given consonant identified, when produced with different acoustic characteristics and in different environments? To which extent do listeners use contextual information? To which extent do they use normalization, and of which kind? These questions lead to studying both vowels and consonants from a dynamic point of view. At this level, what are the respective roles of static targets and dynamics information? Previous studies reveal that synthesized transitions situated on a F1-F2 plane but beyond the values observed in natural speech can be perceived as V1V2: that is, vowel-to-vowel transitions can be characterized simply by the direction and rate of the transitions, even when absolute frequency values are outside of the vowel triangle. The present paper extends the investigation to consonants: it reports new experiments showing that perception of pseudo-V1CV2 can also be obtained with formant transitions situated outside the vowel triangle.