INTERSPEECH.2023 - Others

Total: 240

#1 Chinese EFL Learners’ Perception of English Prosodic Focus [PDF] [Copy] [Kimi3]

Authors: Xinya Zhang ; Ying Chen

Focus in a sentence can be realized prosodically in speech communication. It has been found not easy for L2 learners to acquire. The present study examines Chinese learners' perception of English prosodic focus, specifically the effects of learners' English proficiency, intonation type, sentence length, and focus location on the perceptual accuracy of English prosodic focus by Chinese EFL learners. Results of two trials in the perception experiment reveal that focus location, intonation type, and English proficiency significantly impacted Chinese learners' perceptual accuracy of both single focus and dual focus in English. Focus in statements was perceived more accurately than that in questions for both single focus and dual focus. Focus located on sentence-final words in questions was perceived more accurately than that on non-final words in questions. Learners' English proficiency positively correlated to the accuracy of focus perception, especially for dual focus.

#2 Pitch Accent Variation and the Interpretation of Rising and Falling Intonation in American English [PDF] [Copy] [Kimi1]

Authors: Thomas Sostarics ; Jennifer Cole

This study tests the division of labor in the meaning conveyed by pitch accents and edge tones in English intonation. In three perception studies, we investigate where the locus of the contrast between an assertive vs inquisitive interpretation resides. By doing so, we also gain insight into the role of potentially meaningful within- and between-category variation in the phonetic implementation of discrete intonational tunes. We find that the pitch accent does not contribute to assertive interpretation. Rather, the distinction between assertive and inquisitive interpretation is cued primarily by the final F0 of the pitch contour regardless of the pitch accent, but that increased overall pitch prominence may trigger a salient focus interpretation that interferes with judging assertiveness.

#3 Tonal coarticulation as a cue for upcoming prosodic boundary [PDF] [Copy] [Kimi1]

Authors: Jianjing Kuang ; May Pik Yu Chan ; Nari Rhee

It has been established that the lack of tonal coarticulation or pitch reset is a salient cue for the beginning of a large prosodic domain, however, it is yet unclear whether tonal coarticulation can be an informative cue for the end of a prosodic domain. We examined this question with two continuous speech corpora of Mandarin, and both expert and crowd-sourced perceptual annotations were used. The FPCA model of the holistic tonal contours shows that the carry-over effect of the preceding tone is significantly affected by the strength of the following boundaries. Stronger carry-over effects are associated with the end of larger prosodic boundaries. Moreover, machine learning classification shows that the fine-grained tonal coarticulation patterns are salient cues for predicting larger prosodic boundaries. This result is further validated by crowd-sourced boundary perceptual ratings from human listeners. This study has important implications for the understanding of prosodic phrasing.

#4 Alignment of Beat Gestures and Prosodic Prominence in German [PDF] [Copy] [Kimi1]

Authors: Sophie Repp ; Lara Muhtz ; Johannes Heim

We present evidence on the alignment of beat gestures and prosodic prominence from a video corpus consisting of six German educational videos for students from six presenters. Our analysis of 120 beat gestures (with a substantial variety of hand shapes) shows that beat gestures almost always align with prosodically prominent syllables, i.e., syllables carrying a pitch accent. Specifically, the stroke always starts before, or - more often - on, a pitch-accented syllable; the apex mostly falls on the accented syllable (74%) but may also occur in subsequent syllables. The degree of prosodic prominence of the accented syllable (in terms of DIMA-prominence levels) is predictive for the position of the apex, which occurs within rather than after the accented syllable more often for higher degrees of prominence. These findings provide new insights into the alignment of prominence-lending features of prosody and gesture, thereby broadening the empirical landscape for beat gestures.

#5 Creak Prevalence and Prosodic Context in Australian English [PDF] [Copy] [Kimi1]

Authors: Hannah White ; Joshua Penney ; Andy Gibson ; Anita Szakay ; Felicity Cox

Creaky voice has been found to mark phrase-finality in many varieties of English, as well as in other languages. The present study aims to investigate whether this is also true for Australian English (AusE), a variety that is understudied in creaky voice research. Using automatic creak detection methods, the need for manual annotation of creak is reduced, and we are able to analyse a large dataset of Australian teenagers' speech. As in other varieties, creak is found to be a marker of finality in AusE. Additionally, we find that males use higher rates of creaky voice than females, challenging the widely held assumption that creak is a feature of female speech.

#6 Speech reduction: position within French prosodic structure [PDF] [Copy] [Kimi1]

Authors: Kübra Bodur ; Roxane Bertrand ; James S. German ; Stéphane Rauzy ; Corinne Fredouille ; Christine Meunier

Variation in the speech signal is a characteristic of spoken language, emerging partially as a result of interactions between various linguistic levels. One example of variation is phonetic reduction, where words are produced with missing or underspecified phonetic forms. Using a French conversational corpus, this paper focuses on the relationship between reduction and prosodic structure to see whether certain positions favor the occurrence of reduction. We annotated and observed the distribution of reduced sequences within specific prosodic domains (Intonational and Accentual Phrases). Preliminary analyses revealed that the detected reductions occur mostly mid- IP and very rarely at IP-final. However, this pattern may vary among speakers, as speakers have different patterns in terms of the number of reductions produced and their positions. It is also usually the case that the reduced sequences occurring mid-IP, coincide with the AP level boundaries, extending from one AP to another.

#7 Transvelar Nasal Coupling Contributing to Speaker Characteristics in Non-nasal Vowels [PDF] [Copy] [Kimi1]

Authors: Ziyu Zhu ; Yujie Chi ; Zhao Zhang ; Kiyoshi Honda ; Jianguo Wei

Nasal-cavity structure is stable in speech and varied across speakers, which potentially gives rise to speaker characteristics. Many studies have reported the acoustic contribution of the nasal cavity for nasal and nasalized sounds with velopharyngeal port opening. However, nasal-cavity resonance does emerge in non-nasal vowels through transvelar nasal coupling, which results in non-negligible modifications to non-nasal vowel spectra. In this study, nasal and oral output sounds were separately recorded during non-nasal utterances, and spectral analysis was conducted. The results indicate clear inter-speaker variability in two spectral measures below 2 kHz: frequency location of double-peaked first nasal-cavity resonance and inconsistent distribution of minor dips above the first resonance. It was also observed that nostril outputs modulate oral output signals to lower the first formant frequency of naturally produced non-low vowels, which also exhibited varied degrees across speakers.

#8 Speech Synthesis from Articulatory Movements Recorded by Real-time MRI [PDF] [Copy] [Kimi1]

Authors: Yuto Otani ; Shun Sawada ; Hidefumi Ohmura ; Kouichi Katsurada

Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.

#9 The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech with a Siamese RNN [PDF] [Copy] [Kimi1]

Authors: Zheng Yuan ; Aldo Pastore ; Dorina de Jong ; Hao Xu ; Luciano Fadiga ; Alessandro D'Ausilio

Phonetic convergence describes the automatic and unconscious speech adaptation of two interlocutors in a conversation. This paper proposes a Siamese recurrent neural network (RNN) architecture to measure the convergence of the holistic spectral characteristics of speech sounds in an L2-L2 interaction. We extend an alternating reading task (the ART) dataset by adding 20 native Slovak L2 English speakers. We train and test the Siamese RNN model to measure phonetic convergence of L2 English speech from three different native language groups: Italian (9 dyads), French (10 dyads) and Slovak (10 dyads). Our results indicate that the Siamese RNN model effectively captures the dynamics of phonetic convergence and the speaker's imitation ability. Moreover, this text-independent model is scalable and capable of handling L1-induced speaker variability.

#10 Did you see that? Exploring the role of vision in the development of consonant feature contrasts in children with cochlear implants [PDF] [Copy] [Kimi1]

Authors: James Mahshie ; Michael Larsen

This project aimed to explore the potential role of vision in speech contrast production and auditory perception development in children with cochlear implants (CWCI). Ten CWCI between 43 and 61 months of age, with at least 2 years of CI experience, served as participants. Employing an auditory imitation task, children's ability to auditorily perceive contrasts that are more or less visible was examined both at baseline and one year after the initial assessment. The children's ability to produce these contrasts was also examined through a picture-naming task. The CWCI tended to produce features in both visibility conditions with greater accuracy than they perceived, both at baseline and at 1 year. Production and perception accuracy increased after one year of CI usage, with the mean perceptual gain for the more visible contrasts exceeding that of the less visible contrasts. The implications of the role of vision in contrast development are discussed.

#11 Phonemic competition in end-to-end ASR models [PDF] [Copy] [Kimi1]

Authors: Louis ten Bosch ; Martijn Bentum ; Lou Boves

Advanced end-to-end ASR systems encode speech signals by means of a multi-layer network architecture. In Wav2vec2.0, for example, a CNN is used as feature encoder on top of which transformer layers are used to map the high-dimensional CNN representations to the elements of some lexicon. Compared to the previous generation of 'modular' ASR systems it is much more difficult to interpret the processing and representations in an end-to-end system from a phonetic point of view. We built a Wav2vec2.0-based end-to-end system for producing broad phonetic transcriptions of Dutch. In this paper we investigate to what extent the CNN features and the representations on several transformer layers of a pre-trained and fine-tuned model reflect widely-shared phonetic knowledge. For that purpose we analyze distances between phones and the phonetic features of the most-activated phones in the output of an MLP classifier operating on the representations in several layers.

#12 Automatic speaker recognition with variation across vocal conditions: a controlled experiment with implications for forensics [PDF] [Copy] [Kimi1]

Authors: Vincent Hughes ; Jessica Wormald ; Paul Foulkes ; Philip Harrison ; Finnian Kelly ; David van der Vloed ; Poppy Welch ; Chenzi Xu

Automatic Speaker Recognition (ASR) involves a complex range of processes to extract, model, and compare speaker-specific information from a pair of voice samples. Using heavily controlled recordings, this paper explores the impact of specific vocal conditions (i.e. vocal setting, disguise, accent guises) on ASR performance. When vocal conditions are matched, ASR performance is generally excellent (whisper is an exception). When conditions are mismatched, as in most forensic cases, we see an increase in discrimination and calibration error in some cases. The most problematic mismatches are those involving whisper and supralaryngeal vocal settings; these produce the greatest phonetic changes to speech. Mismatches involving high pitch also produce poor performance, although this appears to be driven by speaker-specific differences in articulatory implementation. We discuss the implications of the findings for the use of ASR in forensic casework and the interpretability of system output.

#13 Exploring Graph Theory Methods For the Analysis of Pronunciation Variation in Spontaneous Speech [PDF] [Copy] [Kimi1]

Authors: Bernhard C. Geiger ; Barbara Schuppler

Given the development of automatic speech recognition based techniques for creating phonetic annotations of large speech corpora, there has been a growing interest in investigating the frequencies of occurrence of phonological and reduction processes. Given that most studies have analyzed these processes separately, they did not provide insights about their co-occurrences. This paper contributes with introducing graph theory methods for the analysis of pronunciation variation in a large corpus of Austrian German conversational speech. More specifically, we investigate how reduction processes that are typical for spontaneous German in general co-occur with phonological processes typical for the Austrian German variety. Whereas our concrete findings are of special interest to scientists investigating variation in German, the approach presented opens new possibilities to analyze pronunciation variation in large corpora of different speaking styles in any language.

#14 Automatic Speaker Recognition performance with matched and mismatched female bilingual speech data [PDF] [Copy] [Kimi1]

Authors: Bryony Nuttall ; Philip Harrison ; Vincent Hughes

Validation of forensic voice comparison methods requires testing using speech samples that are representative of forensic casework conditions. Increasingly, around the world, forensic voice comparison casework is being undertaken using automatic speaker recognition (ASR) systems. However, multilingualism remains a key issue in applying automatic systems to forensic casework. This research aims to consider the effect of language on ASR performance, testing developers' claims of 'language independency'. Specifically, we examine the extent to which language mismatch either between the known and questioned samples, or between the evidential samples and the calibration data, affects overall system performance and the resulting strength of evidence (i.e., likelihood ratios for individual comparisons). Results indicate that mixed language trials produce more errors than single language trials which makes drawing evidential conclusions based on bilingual data challenging.

#15 A Personalised Speech Communication Application for Dysarthric Speakers [PDF] [Copy] [Kimi1]

Authors: Matthew Gibson ; Ievgen Karaulov ; Oleksii Zhelo ; Filip Jurcicek

Individuals with impaired speech are often understood only by those familiar with their speech e.g. a care-giver or close family member. These impaired speakers are therefore highly dependent upon those familiar listeners for their spoken communication needs. These needs vary from basic expressions of hunger or thirst to much more advanced requirements like being understood at a work meeting. A significant subset of individuals with impaired speech also have reduced motor function which limits their mobility or dexterity. For this subset of individuals, the ability to communicate via the medium of speech is crucial. This paper describes a personalised speech communication application targeted towards English language speakers with impaired speech. This application enables the user to hold conversations with other humans, dictate text to a machine and participate in meetings via closed captioning.

#16 Video Multimodal Emotion Recognition System for Real World Applications [PDF] [Copy] [Kimi1]

Authors: Sun-Kyung Lee ; Jong-Hwan Kim

This paper proposes a system capable of recognizing a speaker's utterance-level emotion through multimodal cues in a video. The system seamlessly integrates multiple AI models to first extract and pre-process multimodal information from the raw video input. Next, an end-to-end MER model sequentially predicts the speaker's emotions at the utterance level. Additionally, users can interactively demonstrate the system through the implemented interface.

#17 Promoting Mental Self-Disclosure in a Spoken Dialogue System [PDF] [Copy] [Kimi1]

Authors: Mahdin Rohmatillah ; Bobbi Aditya ; Li-Jen Yang ; Bryan Gautama Ngo ; Willianto Sulaiman ; Jen-Tzung Chien

This paper proposes a mental health spoken dialogue to relax mental illness for university students by acting as an active listener to promote self-disclosure. The proposed system is designed for Mandarin with the specific accent and lexicon in Taiwan which is known as one of the underrepresented spoken languages. To achieve the objective, this work considers three key factors which are high quality speech components including automatic speech recognition and text-to-speech models, and the personalized responses while keeping the trustworthiness and seamless integration among dialogue system components.

#18 "Select language, modality or put on a mask!" Experiments with Multimodal Emotion Recognition [PDF] [Copy] [Kimi1]

Authors: Paweł Bujnowski ; Bartłomiej Kuźma ; Bartłomiej Paziewski ; Jacek Rutkowski ; Joanna Marhula ; Zuzanna Bordzicka ; Piotr Andruszkiewicz

We propose a system designed for multimodal emotion recognition. Our research focuses on showing the impact of various signals in the emotion recognition process. Apart from reporting the average results of our models, we would like to encourage individual engagement of conference participants and explore how a unique emotional scene recorded on the spot can be interpreted by the models - for individual modalities as well as their combinations. Our models work for English, German and Korean. We show the comparison of emotion recognition accuracy for these 3 languages, including the influence of each modality. Our second experiment explores emotion recognition for people wearing face masks. We show that the use of face masks affects not only the video signal but also audio and text. To our knowledge, no other study shows the effects of wearing a mask for three modalities. Unlike other studies where masks are added artificially, we use real recordings with actors in masks.

#19 My Vowels Matter: Formant Automation Tools for Diverse Child Speech [PDF] [Copy] [Kimi1]

Authors: Hannah Valentine ; Joel MacAuslan ; Maria Grigos ; Marisha Speights

Tools to automate formant measurement in vowels have been developed recently, but they have not been tested on pediatric speech samples. Critically, child speech includes unique acoustic challenges including high fundamental frequencies, wide formant bandwidths, more variable formant values, and increased subglottal coupling relative to adult speech. More importantly, these tools have not been tested on the diverse linguistic variations spoken by children. This study compares three tools for automatic formant estimation: Voweltine, Fast Track, and SpeechMark. The tools are tested on vowel productions from a young child with a speech sound disorder from a Black-identifying family. Benefits and tradeoffs of each automation tool are discussed.

#20 NEMA: An Ecologically Valid Tool for Assessing Hearing Devices, Advanced Algorithms, and Communication in Diverse Listening Environments [PDF] [Copy] [Kimi1]

Authors: Nicky Chong-White ; Arun Sebastian ; Jorge Mejia

Ecological Momentary Assessment (EMA) is valuable research method for evaluating the real-world performance of novel computational algorithms and device technologies, addressing the shortcomings of objective metrics and laboratory assessments. Our customisable, cloud-connected smartphone app, NEMA, gathers repeated self-reports and related acoustic features in users' natural environments, providing personalised insights on how specific technologies impact daily activities. NEMA has proven effective in assessing the real-world performance of novel hearing aid algorithms and features, while also improving our understanding of the challenges faced by those with hearing loss which drive new developments. This paper outlines NEMA's innovative features designed to facilitate efficient data collection and presents findings from a recent clinical trial where NEMA played a key role in providing real-world evidence of user benefits for a medical device seeking FDA approval.

#21 When Words Speak Just as Loudly as Actions: Virtual Agent Based Remote Health Assessment Integrating What Patients Say with What They Do [PDF] [Copy] [Kimi1]

Authors: Vikram Ramanarayanan ; David Pautler ; Lakshmi Arbatti ; Abhishek Hosamath ; Michael Neumann ; Hardik Kothare ; Oliver Roesler ; Jackson Liscombe ; Andrew Cornish ; Doug Habberstad ; Vanessa Richter ; David Fox ; David Suendermann-Oeft ; Ira Shoulson

We present a unified multimodal dialog platform for the remote assessment and monitoring of patients' neurological and mental health. Tina, a virtual agent, guides participants through an immersive interaction wherein objective speech, facial, linguistic and cognitive biomarkers can be automatically computed from participant speech and video in near real time. Furthermore, Tina encourages participants to describe, in their own words, their most bothersome problems and what makes them better or worse, through the Patient Report of Problems (PROP) instrument. The PROP captures unfiltered verbatim replies of patients, in contrast with traditional patient reported outcomes that typically rely on categorical assessments. We argue that combining these patient reports (i.e., what they say) with objective biomarkers (i.e., how they say it and what they do) can greatly enhance the quality of telemedicine and improve the efficacy of siteless trials and digital therapeutic interventions.

#22 Stuttering Detection Application [PDF] [Copy] [Kimi1]

Authors: Kowshik Siva Sai Motepalli ; Vamshiraghusimha Narasinga ; Harsha Pathuri ; Hina Khan ; Sangeetha Mahesh ; Ajish K. Abraham ; Anil Kumar Vuppala

Stuttering is a prevalent speech disorder that affects millions of people worldwide. In this Show and Tell presentation, we demonstrate a novel platform that takes speech samples in English and Kannada to detect and analyze stuttering in patients. The user-friendly interface includes demographic details and speech samples, generating comprehensive reports for different stuttering disfluencies. The platform has four different user types, providing full read-only access for admins and full write access for super admins. Our platform provides valuable assistance for speech-language pathologists to evaluate speech samples. The proposed platform supports both live and recorded speech samples and presents a flexible approach to stuttering detection and analysis. Our research demonstrates the potential of technology to improve speech-language pathology for stuttering. Used F-score as a metric for evaluating the models for the stutter detection task.

#23 Providing Interpretable Insights for Neurological Speech and Cognitive Disorders from Interactive Serious Games [PDF] [Copy] [Kimi1]

Authors: Mario Zusag ; Laurin Wagner

We propose an automated pipeline for robustly identifying neurological disorders from interactive therapeutic exercises, which are gathered via the mobile therapy app myReha. The app captures speech and cognitive parameters from over 30.000 tasks in various scenarios. Users get immediate and highly accurate feedback for pronunciation and coherency for language tasks, while voice recordings are fed to a feature extraction pipeline in the backend. These features are then used to construct speech characteristics, which are highly indicative of different neurological disorders, such as acquired aphasia after stroke. The data is visually presented in a web application nyra.insights, which allows medical professionals to quickly derive recommendations for treatment and closely monitor outcomes. During the Show and Tell session, users can experiment with the interactive myReha app and experience the real-time speech analysis capabilities via the nyra.insights web platform.

#24 Automated Neural Nursing Assistant (ANNA): An Over-The-Phone System for Cognitive Monitoring [PDF] [Copy] [Kimi1]

Authors: Jacob Solinsky ; Raymond Finzel ; Martin Michalowski ; Serguei Pakhomov

ANNA is a telephony-based cognitive assessment tool designed to aid nurses in caring for patients who require close monitoring for the development of confusion or neurological impairment. Of particular concern is the treatment of Immune Effector Cell-Associated Neurotoxicity Syndrome (ICANS), a condition which occurs quite frequently as an adverse outcome of Chimeric Antigen Receptor-T (CAR-T) cancer immunotherapy. ANNA employs both traditional verbal tests for cognitive impairment and novel linguistic methods which identify abnormalities in the patient's speech during ordinary conversation. To collect ordinary speech it uses a lightweight instance of the Facebook's Large Language Model BlenderBot to engage the patient in a partially unscripted conversation. ANNA is designed with easy employment by healthcare providers in mind, being sufficiently lightweight to run on consumer-grade hardware and needing access only to a patient's phone number to interact with them.

#25 5G-IoT Cloud based Demonstration of Real-Time Audio-Visual Speech Enhancement for Multimodal Hearing-aids [PDF] [Copy] [Kimi1]

Authors: Ankit Gupta ; Abhijeet Bishnu ; Mandar Gogate ; Kia Dashtipour ; Tughrul Arslan ; Ahsan Adeel ; Amir Hussain ; Tharmalingam Ratnarajah ; Mathini Sellathurai

Over twenty percent of the world's population suffers from some form of hearing loss, making it one of the most significant public health challenges. Current hearing aids commonly amplify noises while failing to improve speech comprehension in crowded social settings. In this demonstration, we showcase a proof-of-concept implementation of the world's first 5G and Internet of Things (IoT) enabled multi-modal hearing aid (MM HA) prototype. This integrates an innovative 5G cloud-radio access network (C-RAN) and IoT based transceiver model for real-time audio-visual speech enhancement (AVSE). Specifically, we demonstrate a transceiver model for Cloud-based AVSE which satisfies high data rate and low latency requirements for future MM HAs. The innovative 5G-IoT transceiver application is shown to satisfy HA latency limitations while transmitting raw noisy AV data from an MM HA prototype device to the cloud for deep learning-based real-time AVSE processing and obtaining a clean audio signal.