| Total: 240
Focus in a sentence can be realized prosodically in speech communication. It has been found not easy for L2 learners to acquire. The present study examines Chinese learners' perception of English prosodic focus, specifically the effects of learners' English proficiency, intonation type, sentence length, and focus location on the perceptual accuracy of English prosodic focus by Chinese EFL learners. Results of two trials in the perception experiment reveal that focus location, intonation type, and English proficiency significantly impacted Chinese learners' perceptual accuracy of both single focus and dual focus in English. Focus in statements was perceived more accurately than that in questions for both single focus and dual focus. Focus located on sentence-final words in questions was perceived more accurately than that on non-final words in questions. Learners' English proficiency positively correlated to the accuracy of focus perception, especially for dual focus.
This study tests the division of labor in the meaning conveyed by pitch accents and edge tones in English intonation. In three perception studies, we investigate where the locus of the contrast between an assertive vs inquisitive interpretation resides. By doing so, we also gain insight into the role of potentially meaningful within- and between-category variation in the phonetic implementation of discrete intonational tunes. We find that the pitch accent does not contribute to assertive interpretation. Rather, the distinction between assertive and inquisitive interpretation is cued primarily by the final F0 of the pitch contour regardless of the pitch accent, but that increased overall pitch prominence may trigger a salient focus interpretation that interferes with judging assertiveness.
It has been established that the lack of tonal coarticulation or pitch reset is a salient cue for the beginning of a large prosodic domain, however, it is yet unclear whether tonal coarticulation can be an informative cue for the end of a prosodic domain. We examined this question with two continuous speech corpora of Mandarin, and both expert and crowd-sourced perceptual annotations were used. The FPCA model of the holistic tonal contours shows that the carry-over effect of the preceding tone is significantly affected by the strength of the following boundaries. Stronger carry-over effects are associated with the end of larger prosodic boundaries. Moreover, machine learning classification shows that the fine-grained tonal coarticulation patterns are salient cues for predicting larger prosodic boundaries. This result is further validated by crowd-sourced boundary perceptual ratings from human listeners. This study has important implications for the understanding of prosodic phrasing.
We present evidence on the alignment of beat gestures and prosodic prominence from a video corpus consisting of six German educational videos for students from six presenters. Our analysis of 120 beat gestures (with a substantial variety of hand shapes) shows that beat gestures almost always align with prosodically prominent syllables, i.e., syllables carrying a pitch accent. Specifically, the stroke always starts before, or - more often - on, a pitch-accented syllable; the apex mostly falls on the accented syllable (74%) but may also occur in subsequent syllables. The degree of prosodic prominence of the accented syllable (in terms of DIMA-prominence levels) is predictive for the position of the apex, which occurs within rather than after the accented syllable more often for higher degrees of prominence. These findings provide new insights into the alignment of prominence-lending features of prosody and gesture, thereby broadening the empirical landscape for beat gestures.
Creaky voice has been found to mark phrase-finality in many varieties of English, as well as in other languages. The present study aims to investigate whether this is also true for Australian English (AusE), a variety that is understudied in creaky voice research. Using automatic creak detection methods, the need for manual annotation of creak is reduced, and we are able to analyse a large dataset of Australian teenagers' speech. As in other varieties, creak is found to be a marker of finality in AusE. Additionally, we find that males use higher rates of creaky voice than females, challenging the widely held assumption that creak is a feature of female speech.
Variation in the speech signal is a characteristic of spoken language, emerging partially as a result of interactions between various linguistic levels. One example of variation is phonetic reduction, where words are produced with missing or underspecified phonetic forms. Using a French conversational corpus, this paper focuses on the relationship between reduction and prosodic structure to see whether certain positions favor the occurrence of reduction. We annotated and observed the distribution of reduced sequences within specific prosodic domains (Intonational and Accentual Phrases). Preliminary analyses revealed that the detected reductions occur mostly mid- IP and very rarely at IP-final. However, this pattern may vary among speakers, as speakers have different patterns in terms of the number of reductions produced and their positions. It is also usually the case that the reduced sequences occurring mid-IP, coincide with the AP level boundaries, extending from one AP to another.
Nasal-cavity structure is stable in speech and varied across speakers, which potentially gives rise to speaker characteristics. Many studies have reported the acoustic contribution of the nasal cavity for nasal and nasalized sounds with velopharyngeal port opening. However, nasal-cavity resonance does emerge in non-nasal vowels through transvelar nasal coupling, which results in non-negligible modifications to non-nasal vowel spectra. In this study, nasal and oral output sounds were separately recorded during non-nasal utterances, and spectral analysis was conducted. The results indicate clear inter-speaker variability in two spectral measures below 2 kHz: frequency location of double-peaked first nasal-cavity resonance and inconsistent distribution of minor dips above the first resonance. It was also observed that nostril outputs modulate oral output signals to lower the first formant frequency of naturally produced non-low vowels, which also exhibited varied degrees across speakers.
Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.
Phonetic convergence describes the automatic and unconscious speech adaptation of two interlocutors in a conversation. This paper proposes a Siamese recurrent neural network (RNN) architecture to measure the convergence of the holistic spectral characteristics of speech sounds in an L2-L2 interaction. We extend an alternating reading task (the ART) dataset by adding 20 native Slovak L2 English speakers. We train and test the Siamese RNN model to measure phonetic convergence of L2 English speech from three different native language groups: Italian (9 dyads), French (10 dyads) and Slovak (10 dyads). Our results indicate that the Siamese RNN model effectively captures the dynamics of phonetic convergence and the speaker's imitation ability. Moreover, this text-independent model is scalable and capable of handling L1-induced speaker variability.
This project aimed to explore the potential role of vision in speech contrast production and auditory perception development in children with cochlear implants (CWCI). Ten CWCI between 43 and 61 months of age, with at least 2 years of CI experience, served as participants. Employing an auditory imitation task, children's ability to auditorily perceive contrasts that are more or less visible was examined both at baseline and one year after the initial assessment. The children's ability to produce these contrasts was also examined through a picture-naming task. The CWCI tended to produce features in both visibility conditions with greater accuracy than they perceived, both at baseline and at 1 year. Production and perception accuracy increased after one year of CI usage, with the mean perceptual gain for the more visible contrasts exceeding that of the less visible contrasts. The implications of the role of vision in contrast development are discussed.
Advanced end-to-end ASR systems encode speech signals by means of a multi-layer network architecture. In Wav2vec2.0, for example, a CNN is used as feature encoder on top of which transformer layers are used to map the high-dimensional CNN representations to the elements of some lexicon. Compared to the previous generation of 'modular' ASR systems it is much more difficult to interpret the processing and representations in an end-to-end system from a phonetic point of view. We built a Wav2vec2.0-based end-to-end system for producing broad phonetic transcriptions of Dutch. In this paper we investigate to what extent the CNN features and the representations on several transformer layers of a pre-trained and fine-tuned model reflect widely-shared phonetic knowledge. For that purpose we analyze distances between phones and the phonetic features of the most-activated phones in the output of an MLP classifier operating on the representations in several layers.
Automatic Speaker Recognition (ASR) involves a complex range of processes to extract, model, and compare speaker-specific information from a pair of voice samples. Using heavily controlled recordings, this paper explores the impact of specific vocal conditions (i.e. vocal setting, disguise, accent guises) on ASR performance. When vocal conditions are matched, ASR performance is generally excellent (whisper is an exception). When conditions are mismatched, as in most forensic cases, we see an increase in discrimination and calibration error in some cases. The most problematic mismatches are those involving whisper and supralaryngeal vocal settings; these produce the greatest phonetic changes to speech. Mismatches involving high pitch also produce poor performance, although this appears to be driven by speaker-specific differences in articulatory implementation. We discuss the implications of the findings for the use of ASR in forensic casework and the interpretability of system output.
Given the development of automatic speech recognition based techniques for creating phonetic annotations of large speech corpora, there has been a growing interest in investigating the frequencies of occurrence of phonological and reduction processes. Given that most studies have analyzed these processes separately, they did not provide insights about their co-occurrences. This paper contributes with introducing graph theory methods for the analysis of pronunciation variation in a large corpus of Austrian German conversational speech. More specifically, we investigate how reduction processes that are typical for spontaneous German in general co-occur with phonological processes typical for the Austrian German variety. Whereas our concrete findings are of special interest to scientists investigating variation in German, the approach presented opens new possibilities to analyze pronunciation variation in large corpora of different speaking styles in any language.
Validation of forensic voice comparison methods requires testing using speech samples that are representative of forensic casework conditions. Increasingly, around the world, forensic voice comparison casework is being undertaken using automatic speaker recognition (ASR) systems. However, multilingualism remains a key issue in applying automatic systems to forensic casework. This research aims to consider the effect of language on ASR performance, testing developers' claims of 'language independency'. Specifically, we examine the extent to which language mismatch either between the known and questioned samples, or between the evidential samples and the calibration data, affects overall system performance and the resulting strength of evidence (i.e., likelihood ratios for individual comparisons). Results indicate that mixed language trials produce more errors than single language trials which makes drawing evidential conclusions based on bilingual data challenging.
Individuals with impaired speech are often understood only by those familiar with their speech e.g. a care-giver or close family member. These impaired speakers are therefore highly dependent upon those familiar listeners for their spoken communication needs. These needs vary from basic expressions of hunger or thirst to much more advanced requirements like being understood at a work meeting. A significant subset of individuals with impaired speech also have reduced motor function which limits their mobility or dexterity. For this subset of individuals, the ability to communicate via the medium of speech is crucial. This paper describes a personalised speech communication application targeted towards English language speakers with impaired speech. This application enables the user to hold conversations with other humans, dictate text to a machine and participate in meetings via closed captioning.
This paper proposes a system capable of recognizing a speaker's utterance-level emotion through multimodal cues in a video. The system seamlessly integrates multiple AI models to first extract and pre-process multimodal information from the raw video input. Next, an end-to-end MER model sequentially predicts the speaker's emotions at the utterance level. Additionally, users can interactively demonstrate the system through the implemented interface.
This paper proposes a mental health spoken dialogue to relax mental illness for university students by acting as an active listener to promote self-disclosure. The proposed system is designed for Mandarin with the specific accent and lexicon in Taiwan which is known as one of the underrepresented spoken languages. To achieve the objective, this work considers three key factors which are high quality speech components including automatic speech recognition and text-to-speech models, and the personalized responses while keeping the trustworthiness and seamless integration among dialogue system components.
We propose a system designed for multimodal emotion recognition. Our research focuses on showing the impact of various signals in the emotion recognition process. Apart from reporting the average results of our models, we would like to encourage individual engagement of conference participants and explore how a unique emotional scene recorded on the spot can be interpreted by the models - for individual modalities as well as their combinations. Our models work for English, German and Korean. We show the comparison of emotion recognition accuracy for these 3 languages, including the influence of each modality. Our second experiment explores emotion recognition for people wearing face masks. We show that the use of face masks affects not only the video signal but also audio and text. To our knowledge, no other study shows the effects of wearing a mask for three modalities. Unlike other studies where masks are added artificially, we use real recordings with actors in masks.
Tools to automate formant measurement in vowels have been developed recently, but they have not been tested on pediatric speech samples. Critically, child speech includes unique acoustic challenges including high fundamental frequencies, wide formant bandwidths, more variable formant values, and increased subglottal coupling relative to adult speech. More importantly, these tools have not been tested on the diverse linguistic variations spoken by children. This study compares three tools for automatic formant estimation: Voweltine, Fast Track, and SpeechMark. The tools are tested on vowel productions from a young child with a speech sound disorder from a Black-identifying family. Benefits and tradeoffs of each automation tool are discussed.
Ecological Momentary Assessment (EMA) is valuable research method for evaluating the real-world performance of novel computational algorithms and device technologies, addressing the shortcomings of objective metrics and laboratory assessments. Our customisable, cloud-connected smartphone app, NEMA, gathers repeated self-reports and related acoustic features in users' natural environments, providing personalised insights on how specific technologies impact daily activities. NEMA has proven effective in assessing the real-world performance of novel hearing aid algorithms and features, while also improving our understanding of the challenges faced by those with hearing loss which drive new developments. This paper outlines NEMA's innovative features designed to facilitate efficient data collection and presents findings from a recent clinical trial where NEMA played a key role in providing real-world evidence of user benefits for a medical device seeking FDA approval.
We present a unified multimodal dialog platform for the remote assessment and monitoring of patients' neurological and mental health. Tina, a virtual agent, guides participants through an immersive interaction wherein objective speech, facial, linguistic and cognitive biomarkers can be automatically computed from participant speech and video in near real time. Furthermore, Tina encourages participants to describe, in their own words, their most bothersome problems and what makes them better or worse, through the Patient Report of Problems (PROP) instrument. The PROP captures unfiltered verbatim replies of patients, in contrast with traditional patient reported outcomes that typically rely on categorical assessments. We argue that combining these patient reports (i.e., what they say) with objective biomarkers (i.e., how they say it and what they do) can greatly enhance the quality of telemedicine and improve the efficacy of siteless trials and digital therapeutic interventions.
Stuttering is a prevalent speech disorder that affects millions of people worldwide. In this Show and Tell presentation, we demonstrate a novel platform that takes speech samples in English and Kannada to detect and analyze stuttering in patients. The user-friendly interface includes demographic details and speech samples, generating comprehensive reports for different stuttering disfluencies. The platform has four different user types, providing full read-only access for admins and full write access for super admins. Our platform provides valuable assistance for speech-language pathologists to evaluate speech samples. The proposed platform supports both live and recorded speech samples and presents a flexible approach to stuttering detection and analysis. Our research demonstrates the potential of technology to improve speech-language pathology for stuttering. Used F-score as a metric for evaluating the models for the stutter detection task.
We propose an automated pipeline for robustly identifying neurological disorders from interactive therapeutic exercises, which are gathered via the mobile therapy app myReha. The app captures speech and cognitive parameters from over 30.000 tasks in various scenarios. Users get immediate and highly accurate feedback for pronunciation and coherency for language tasks, while voice recordings are fed to a feature extraction pipeline in the backend. These features are then used to construct speech characteristics, which are highly indicative of different neurological disorders, such as acquired aphasia after stroke. The data is visually presented in a web application nyra.insights, which allows medical professionals to quickly derive recommendations for treatment and closely monitor outcomes. During the Show and Tell session, users can experiment with the interactive myReha app and experience the real-time speech analysis capabilities via the nyra.insights web platform.
ANNA is a telephony-based cognitive assessment tool designed to aid nurses in caring for patients who require close monitoring for the development of confusion or neurological impairment. Of particular concern is the treatment of Immune Effector Cell-Associated Neurotoxicity Syndrome (ICANS), a condition which occurs quite frequently as an adverse outcome of Chimeric Antigen Receptor-T (CAR-T) cancer immunotherapy. ANNA employs both traditional verbal tests for cognitive impairment and novel linguistic methods which identify abnormalities in the patient's speech during ordinary conversation. To collect ordinary speech it uses a lightweight instance of the Facebook's Large Language Model BlenderBot to engage the patient in a partially unscripted conversation. ANNA is designed with easy employment by healthcare providers in mind, being sufficiently lightweight to run on consumer-grade hardware and needing access only to a patient's phone number to interact with them.
Over twenty percent of the world's population suffers from some form of hearing loss, making it one of the most significant public health challenges. Current hearing aids commonly amplify noises while failing to improve speech comprehension in crowded social settings. In this demonstration, we showcase a proof-of-concept implementation of the world's first 5G and Internet of Things (IoT) enabled multi-modal hearing aid (MM HA) prototype. This integrates an innovative 5G cloud-radio access network (C-RAN) and IoT based transceiver model for real-time audio-visual speech enhancement (AVSE). Specifically, we demonstrate a transceiver model for Cloud-based AVSE which satisfies high data rate and low latency requirements for future MM HAs. The innovative 5G-IoT transceiver application is shown to satisfy HA latency limitations while transmitting raw noisy AV data from an MM HA prototype device to the cloud for deep learning-based real-time AVSE processing and obtaining a clean audio signal.