Audio and Speech Processing | Cool Papers - Immersive Paper Discovery

#1 Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes [PDF] [Copy] [Kimi] [REL]

Authors: Ivan Kukanov ; Janne Laakkonen ; Tomi Kinnunen ; Ville Hautamäki

Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn attack-invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high-scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few-shot adaptation ensures that the system remains up-to-date.

Subjects: Audio and Speech Processing ; Artificial Intelligence ; Sound

Publish: 2024-10-27 20:14:32 UTC

#2 Analyzing long-term rhythm variations in Mising and Assamese using frequency domain correlates [PDF] [Copy] [Kimi] [REL]

Authors: Parismita Gogoi ; Priyankoo Sarmah ; S. R. M. Prasanna

The current work explores long-term speech rhythm variations to classify Mising and Assamese, two low-resourced languages from Assam, Northeast India. We study the temporal information of speech rhythm embedded in low-frequency (LF) spectrograms derived from amplitude (AM) and frequency modulation (FM) envelopes. This quantitative frequency domain analysis of rhythm is supported by the idea of rhythm formant analysis (RFA), originally proposed by Gibbon [1]. We attempt to make the investigation by extracting features derived from trajectories of first six rhythm formants along with two-dimensional discrete cosine transform-based characterizations of the AM and FM LF spectrograms. The derived features are fed as input to a machine learning tool to contrast rhythms of Assamese and Mising. In this way, an improved methodology for empirically investigating rhythm variation structure without prior annotation of the larger unit of the speech signal is illustrated for two low-resourced languages of Northeast India.

Subject: Audio and Speech Processing

Publish: 2024-10-26 06:22:32 UTC

#3 GPT-4o System Card [PDF⁷²] [Copy] [Kimi⁵¹] [REL]

Authors: OpenAI : Aaron Hurst ; Adam Lerer ; Adam P. Goucher ; Adam Perelman ; Aditya Ramesh ; Aidan Clark ; AJ Ostrow ; Akila Welihinda ; Alan Hayes ; Alec Radford ; Aleksander Mądry ; Alex Baker-Whitcomb ; Alex Beutel ; Alex Borzunov ; Alex Carney ; Alex Chow ; Alex Kirillov ; Alex Nichol ; Alex Paino ; Alex Renzin ; Alex Tachard Passos ; Alexander Kirillov ; Alexi Christakis ; Alexis Conneau ; Ali Kamali ; Allan Jabri ; Allison Moyer ; Allison Tam ; Amadou Crookes ; Amin Tootoochian ; Amin Tootoonchian ; Ananya Kumar ; Andrea Vallone ; Andrej Karpathy ; Andrew Braunstein ; Andrew Cann ; Andrew Codispoti ; Andrew Galu ; Andrew Kondrich ; Andrew Tulloch ; Andrey Mishchenko ; Angela Baek ; Angela Jiang ; Antoine Pelisse ; Antonia Woodford ; Anuj Gosalia ; Arka Dhar ; Ashley Pantuliano ; Avi Nayak ; Avital Oliver ; Barret Zoph ; Behrooz Ghorbani ; Ben Leimberger ; Ben Rossen ; Ben Sokolowsky ; Ben Wang ; Benjamin Zweig ; Beth Hoover ; Blake Samic ; Bob McGrew ; Bobby Spero ; Bogo Giertler ; Bowen Cheng ; Brad Lightcap ; Brandon Walkin ; Brendan Quinn ; Brian Guarraci ; Brian Hsu ; Bright Kellogg ; Brydon Eastman ; Camillo Lugaresi ; Carroll Wainwright ; Cary Bassin ; Cary Hudson ; Casey Chu ; Chad Nelson ; Chak Li ; Chan Jun Shern ; Channing Conger ; Charlotte Barette ; Chelsea Voss ; Chen Ding ; Cheng Lu ; Chong Zhang ; Chris Beaumont ; Chris Hallacy ; Chris Koch ; Christian Gibson ; Christina Kim ; Christine Choi ; Christine McLeavey ; Christopher Hesse ; Claudia Fischer ; Clemens Winter ; Coley Czarnecki ; Colin Jarvis ; Colin Wei ; Constantin Koumouzelis ; Dane Sherburn ; Daniel Kappler ; Daniel Levin ; Daniel Levy ; David Carr ; David Farhi ; David Mely ; David Robinson ; David Sasaki ; Denny Jin ; Dev Valladares ; Dimitris Tsipras ; Doug Li ; Duc Phong Nguyen ; Duncan Findlay ; Edede Oiwoh ; Edmund Wong ; Ehsan Asdar ; Elizabeth Proehl ; Elizabeth Yang ; Eric Antonow ; Eric Kramer ; Eric Peterson ; Eric Sigler ; Eric Wallace ; Eugene Brevdo ; Evan Mays ; Farzad Khorasani ; Felipe Petroski Such ; Filippo Raso ; Francis Zhang ; Fred von Lohmann ; Freddie Sulit ; Gabriel Goh ; Gene Oden ; Geoff Salmon ; Giulio Starace ; Greg Brockman ; Hadi Salman ; Haiming Bao ; Haitang Hu ; Hannah Wong ; Haoyu Wang ; Heather Schmidt ; Heather Whitney ; Heewoo Jun ; Hendrik Kirchner ; Henrique Ponde de Oliveira Pinto ; Hongyu Ren ; Huiwen Chang ; Hyung Won Chung ; Ian Kivlichan ; Ian O'Connell ; Ian O'Connell ; Ian Osband ; Ian Silber ; Ian Sohl ; Ibrahim Okuyucu ; Ikai Lan ; Ilya Kostrikov ; Ilya Sutskever ; Ingmar Kanitscheider ; Ishaan Gulrajani ; Jacob Coxon ; Jacob Menick ; Jakub Pachocki ; James Aung ; James Betker ; James Crooks ; James Lennon ; Jamie Kiros ; Jan Leike ; Jane Park ; Jason Kwon ; Jason Phang ; Jason Teplitz ; Jason Wei ; Jason Wolfe ; Jay Chen ; Jeff Harris ; Jenia Varavva ; Jessica Gan Lee ; Jessica Shieh ; Ji Lin ; Jiahui Yu ; Jiayi Weng ; Jie Tang ; Jieqi Yu ; Joanne Jang ; Joaquin Quinonero Candela ; Joe Beutler ; Joe Landers ; Joel Parish ; Johannes Heidecke ; John Schulman ; Jonathan Lachman ; Jonathan McKay ; Jonathan Uesato ; Jonathan Ward ; Jong Wook Kim ; Joost Huizinga ; Jordan Sitkin ; Jos Kraaijeveld ; Josh Gross ; Josh Kaplan ; Josh Snyder ; Joshua Achiam ; Joy Jiao ; Joyce Lee ; Juntang Zhuang ; Justyn Harriman ; Kai Fricke ; Kai Hayashi ; Karan Singhal ; Katy Shi ; Kavin Karthik ; Kayla Wood ; Kendra Rimbach ; Kenny Hsu ; Kenny Nguyen ; Keren Gu-Lemberg ; Kevin Button ; Kevin Liu ; Kiel Howe ; Krithika Muthukumar ; Kyle Luther ; Lama Ahmad ; Larry Kai ; Lauren Itow ; Lauren Workman ; Leher Pathak ; Leo Chen ; Li Jing ; Lia Guy ; Liam Fedus ; Liang Zhou ; Lien Mamitsuka ; Lilian Weng ; Lindsay McCallum ; Lindsey Held ; Long Ouyang ; Louis Feuvrier ; Lu Zhang ; Lukas Kondraciuk ; Lukasz Kaiser ; Luke Hewitt ; Luke Metz ; Lyric Doshi ; Mada Aflak ; Maddie Simens ; Madelaine Boyd ; Madeleine Thompson ; Marat Dukhan ; Mark Chen ; Mark Gray ; Mark Hudnall ; Marvin Zhang ; Marwan Aljubeh ; Mateusz Litwin ; Matthew Zeng ; Max Johnson ; Maya Shetty ; Mayank Gupta ; Meghan Shah ; Mehmet Yatbaz ; Meng Jia Yang ; Mengchao Zhong ; Mia Glaese ; Mianna Chen ; Michael Janner ; Michael Lampe ; Michael Petrov ; Michael Wu ; Michele Wang ; Michelle Fradin ; Michelle Pokrass ; Miguel Castro ; Miguel Oom Temudo de Castro ; Mikhail Pavlov ; Miles Brundage ; Miles Wang ; Minal Khan ; Mira Murati ; Mo Bavarian ; Molly Lin ; Murat Yesildal ; Nacho Soto ; Natalia Gimelshein ; Natalie Cone ; Natalie Staudacher ; Natalie Summers ; Natan LaFontaine ; Neil Chowdhury ; Nick Ryder ; Nick Stathas ; Nick Turley ; Nik Tezak ; Niko Felix ; Nithanth Kudige ; Nitish Keskar ; Noah Deutsch ; Noel Bundick ; Nora Puckett ; Ofir Nachum ; Ola Okelola ; Oleg Boiko ; Oleg Murk ; Oliver Jaffe ; Olivia Watkins ; Olivier Godement ; Owen Campbell-Moore ; Patrick Chao ; Paul McMillan ; Pavel Belov ; Peng Su ; Peter Bak ; Peter Bakkum ; Peter Deng ; Peter Dolan ; Peter Hoeschele ; Peter Welinder ; Phil Tillet ; Philip Pronin ; Philippe Tillet ; Prafulla Dhariwal ; Qiming Yuan ; Rachel Dias ; Rachel Lim ; Rahul Arora ; Rajan Troll ; Randall Lin ; Rapha Gontijo Lopes ; Raul Puri ; Reah Miyara ; Reimar Leike ; Renaud Gaubert ; Reza Zamani ; Ricky Wang ; Rob Donnelly ; Rob Honsby ; Rocky Smith ; Rohan Sahai ; Rohit Ramchandani ; Romain Huet ; Rory Carmichael ; Rowan Zellers ; Roy Chen ; Ruby Chen ; Ruslan Nigmatullin ; Ryan Cheu ; Saachi Jain ; Sam Altman ; Sam Schoenholz ; Sam Toizer ; Samuel Miserendino ; Sandhini Agarwal ; Sara Culver ; Scott Ethersmith ; Scott Gray ; Sean Grove ; Sean Metzger ; Shamez Hermani ; Shantanu Jain ; Shengjia Zhao ; Sherwin Wu ; Shino Jomoto ; Shirong Wu ; Shuaiqi ; Xia ; Sonia Phene ; Spencer Papay ; Srinivas Narayanan ; Steve Coffey ; Steve Lee ; Stewart Hall ; Suchir Balaji ; Tal Broda ; Tal Stramer ; Tao Xu ; Tarun Gogineni ; Taya Christianson ; Ted Sanders ; Tejal Patwardhan ; Thomas Cunninghman ; Thomas Degry ; Thomas Dimson ; Thomas Raoux ; Thomas Shadwell ; Tianhao Zheng ; Todd Underwood ; Todor Markov ; Toki Sherbakov ; Tom Rubin ; Tom Stasi ; Tomer Kaftan ; Tristan Heywood ; Troy Peterson ; Tyce Walters ; Tyna Eloundou ; Valerie Qi ; Veit Moeller ; Vinnie Monaco ; Vishal Kuo ; Vlad Fomenko ; Wayne Chang ; Weiyi Zheng ; Wenda Zhou ; Wesam Manassra ; Will Sheu ; Wojciech Zaremba ; Yash Patil ; Yilei Qian ; Yongjik Kim ; Youlong Cheng ; Yu Zhang ; Yuchen He ; Yuchen Zhang ; Yujia Jin ; Yunxing Dai ; Yury Malkov

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

Subjects: Computation and Language ; Artificial Intelligence ; Computer Vision and Pattern Recognition ; Computers and Society ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-10-25 17:43:01 UTC

#4 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Xize Cheng ; Siqi Zheng ; Zehan Wang ; Minghui Fang ; Ziang Zhang ; Rongjie Huang ; Ziyang Ma ; Shengpeng Ji ; Jialong Zuo ; Tao Jin ; Zhou Zhao

The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}.

Subjects: Sound ; Computer Vision and Pattern Recognition ; Multimedia ; Audio and Speech Processing

Publish: 2024-10-28 17:58:15 UTC

#5 ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time Optimization [PDF] [Copy] [Kimi] [REL]

Authors: Christian J. Steinmetz ; Shubhr Singh ; Marco Comunità ; Ilias Ibnyahya ; Shanxin Yuan ; Emmanouil Benetos ; Joshua D. Reiss

Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-10-28 17:24:37 UTC

#6 SepMamba: State-space models for speaker separation using Mamba [PDF¹] [Copy] [Kimi] [REL]

Authors: Thor Højhus Avenstrup ; Boldizsár Elek ; István László Mádi ; András Bence Schin ; Morten Mørup ; Bjørn Sand Jensen ; Kenny Falkær Olsen

Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.

Subjects: Sound ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-28 13:20:53 UTC

#7 Atrial Fibrillation Detection System via Acoustic Sensing for Mobile Phones [PDF¹] [Copy] [Kimi] [REL]

Authors: Xuanyu Liu ; Jiao Li ; Haoxian Liu ; Zongqi Yang ; Yi Huang ; Jin Zhang

Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution, however, these systems have various applicability issues such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present MobileAF, a novel smartphone-based AF detection system using speakers and microphones. In order to capture minute cardiac activities, we propose a multi-channel pulse wave probing method. In addition, we enhance the signal quality by introducing a three-stage pulse wave purification pipeline. What's more, a ResNet-based network model is built to implement accurate and reliable AF detection. We collect data from 23 participants utilizing our data collection application on the smartphone. Extensive experimental results demonstrate the superior performance of our system, with 97.9% accuracy, 96.8% precision, 97.2% recall, 98.3% specificity, and 97.0% F1 score.

Subjects: Sound ; Computational Engineering, Finance, and Science ; Audio and Speech Processing ; Quantitative Methods

Publish: 2024-10-28 09:14:14 UTC

#8 Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning [PDF] [Copy] [Kimi¹] [REL]

Authors: Bing Han ; Wen Huang ; Zhengyang Chen ; Anbai Jiang ; Pingyi Fan ; Cheng Lu ; Zhiqiang Lv ; Jia Liu ; Wei-Qiang Zhang ; Yanmin Qian

The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-10-28 06:31:20 UTC

#9 An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation [PDF¹] [Copy] [Kimi] [REL]

Authors: Saarth Vardhan ; Pavani R Acharya ; Samarth S Rao ; Oorjitha Ratna Jasthi ; S Natarajan

Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals. This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance across traditional Vocal, Drum, and Bass (VDB) stems, as well as expanding into second-level hierarchical separation for sub-stems like kick, snare, lead vocals, and background vocals. Our method addresses the limitations of relying on a single model by utilising the complementary strengths of various models, leading to more balanced results across stems. For stem selection, we used the harmonic mean of Signal-to-Noise Ratio (SNR) and Signal-to-Distortion Ratio (SDR), ensuring that extreme values do not skew the results and that both metrics are weighted effectively. In addition to consistently high performance across the VDB stems, we also explored second-level hierarchical separation, revealing important insights into the complexities of MSS and how factors like genre and instrumentation can influence model performance. While the second-level separation results show room for improvement, the ability to isolate sub-stems marks a significant advancement. Our findings pave the way for further research in MSS, particularly in expanding model capabilities beyond VDB and improving niche stem separations such as guitar and piano.

Subjects: Sound ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-28 06:18:12 UTC

#10 Mitigating Unauthorized Speech Synthesis for Voice Protection [PDF] [Copy] [Kimi] [REL]

Authors: Zhisheng Zhang ; Qianyi Yang ; Derui Wang ; Pengyang Huang ; Yuxin Cao ; Kai Ye ; Jie Hao

With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards.

Subjects: Sound ; Artificial Intelligence ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-28 05:16:37 UTC

#11 Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors [PDF] [Copy] [Kimi] [REL]

Authors: Sadia Nowrin ; Keith Vertanen

Conversational systems rely heavily on speech recognition to interpret and respond to user commands and queries. Nevertheless, recognition errors may occur, which can significantly affect the performance of such systems. While visual feedback can help detect errors, it may not always be practical, especially for people who are blind or low-vision. In this study, we investigate ways to improve error detection by manipulating the audio output of the transcribed text based on the recognizer's confidence level in its result. Our findings show that selectively slowing down the audio when the recognizer exhibited uncertainty led to a relative increase of 12% in participants' error detection ability compared to uniformly slowing down the audio.

Subjects: Human-Computer Interaction ; Sound ; Audio and Speech Processing

Publish: 2024-10-27 19:33:01 UTC

#12 Automatic Estimation of Singing Voice Musical Dynamics [PDF] [Copy] [Kimi] [REL]

Authors: Jyoti Narang ; Nazif Can Tamer ; Viviana De La Vega ; Xavier Serra

Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic.

Subjects: Sound ; Information Retrieval ; Audio and Speech Processing

Publish: 2024-10-27 18:15:18 UTC

#13 MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music [PDF] [Copy] [Kimi] [REL]

Authors: Michał Wiszenko ; Kacper Stefański ; Piotr Malesa ; Łukasz Pokorzyński ; Mateusz Modrzejewski

Symbolic music research plays a crucial role in music-related machine learning, but MIDI data can be complex for those without musical expertise. To address this issue, we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MIDI tokenization methods from the MidiTok Python package. MidiTok Visualizer offers numerous customizable parameters, enabling users to upload MIDI files to visualize tokenized data alongside an interactive piano roll.

Subjects: Sound ; Artificial Intelligence ; Multimedia ; Audio and Speech Processing

Publish: 2024-10-27 17:00:55 UTC

#14 Symbotunes: unified hub for symbolic music generative models [PDF] [Copy] [Kimi] [REL]

Authors: Paweł Skierś ; Maksymilian Łazarski ; Michał Kopeć ; Mateusz Modrzejewski

Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.

Subjects: Sound ; Artificial Intelligence ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-27 16:54:58 UTC

#15 MusicFlow: Cascaded Flow Matching for Text Guided Music Generation [PDF] [Copy] [Kimi] [REL]

Authors: K R Prajwal ; Bowen Shi ; Matthew Lee ; Apoorv Vyas ; Andros Tjandra ; Mahi Luthra ; Baishan Guo ; Huiyu Wang ; Triantafyllos Afouras ; David Kant ; Wei-Ning Hsu

We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

Subjects: Sound ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-10-27 15:35:41 UTC

#16 Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios [PDF] [Copy] [Kimi] [REL]

Authors: Yongkang Cheng ; Mingjiang Liang ; Shaoli Huang ; Gaoge Han ; Jifeng Ning ; Wei Liu

Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation.

Subjects: Sound ; Artificial Intelligence ; Computer Vision and Pattern Recognition ; Graphics ; Audio and Speech Processing

Publish: 2024-10-27 07:25:11 UTC

#17 An approach to hummed-tune and song sequences matching [PDF] [Copy] [Kimi] [REL]

Authors: Loc Bao Pham ; Huong Hoang Luong ; Phu Thien Tran ; Phuc Hoang Ngo ; Vi Hoang Nguyen ; Thinh Nguyen

Melody stuck in your head, also known as "earworm", is tough to get rid of, unless you listen to it again or sing it out loud. But what if you can not find the name of that song? It must be an intolerable feeling. Recognizing a song name base on humming sound is not an easy task for a human being and should be done by machines. However, there is no research paper published about hum tune recognition. Adapting from Hum2Song Zalo AI Challenge 2021 - a competition about querying the name of a song by user's giving humming tune, which is similar to Google's Hum to Search. This paper covers details about the pre-processed data from the original type (mp3) to usable form for training and inference. In training an embedding model for the feature extraction phase, we ran experiments with some states of the art, such as ResNet, VGG, AlexNet, MobileNetV2. And for the inference phase, we use the Faiss module to effectively search for a song that matched the sequence of humming sound. The result comes at nearly 94\% in MRR@10 metric on the public test set, along with the top 1 result on the public leaderboard.

Subjects: Sound ; Artificial Intelligence ; Information Retrieval ; Audio and Speech Processing

Publish: 2024-10-27 06:50:43 UTC

#18 Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation [PDF²] [Copy] [Kimi¹] [REL]

Authors: Maohao Shen ; Shun Zhang ; Jilong Wu ; Zhiping Xiu ; Ehab AlBadawy ; Yiting Lu ; Mike Seltzer ; Qing He

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Subjects: Computation and Language ; Artificial Intelligence ; Sound ; Audio and Speech Processing

Publish: 2024-10-27 04:28:57 UTC

#19 emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography [PDF] [Copy] [Kimi] [REL]

Authors: Viswanath Sivakumar ; Jeffrey Seely ; Alan Du ; Sean R Bittner ; Adam Berenzweig ; Anuoluwapo Bolarinwa ; Alexandre Gramfort ; Michael I Mandel

Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at https://github.com/facebookresearch/emg2qwerty.

Subjects: Machine Learning ; Human-Computer Interaction ; Audio and Speech Processing

Publish: 2024-10-26 05:18:48 UTC

#20 Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions? [PDF] [Copy] [Kimi] [REL]

Authors: Opeyemi Osakuade ; Simon King

Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-10-25 19:13:25 UTC

#21 Single-word Auditory Attention Decoding Using Deep Learning Model [PDF] [Copy] [Kimi²] [REL]

Authors: Nhan Duc Thanh Nguyen ; Huy Phan ; Kaare Mikkelsen ; Preben Kidmose

Identifying auditory attention by comparing auditory stimuli and corresponding brain responses, is known as auditory attention decoding (AAD). The majority of AAD algorithms utilize the so-called envelope entrainment mechanism, whereby auditory attention is identified by how the envelope of the auditory stream drives variation in the electroencephalography (EEG) signal. However, neural processing can also be decoded based on endogenous cognitive responses, in this case, neural responses evoked by attention to specific words in a speech stream. This approach is largely unexplored in the field of AAD but leads to a single-word auditory attention decoding problem in which an epoch of an EEG signal timed to a specific word is labeled as attended or unattended. This paper presents a deep learning approach, based on EEGNet, to address this challenge. We conducted a subject-independent evaluation on an event-based AAD dataset with three different paradigms: word category oddball, word category with competing speakers, and competing speech streams with targets. The results demonstrate that the adapted model is capable of exploiting cognitive-related spatiotemporal EEG features and achieving at least 58% accuracy on the most realistic competing paradigm for the unseen subjects. To our knowledge, this is the first study dealing with this problem.

Subjects: Signal Processing ; Artificial Intelligence ; Human-Computer Interaction ; Sound ; Audio and Speech Processing ; Neurons and Cognition

Publish: 2024-10-15 21:57:19 UTC

#22 A Novel Numerical Method for Relaxing the Minimal Configurations of TOA-Based Joint Sensors and Sources Localization [PDF¹] [Copy] [Kimi] [REL]

Authors: Faxian Cao ; Yongqiang Cheng ; Adil Mehmood Khan ; Zhijing Yang ; Yingxiu Chang

This work introduces a novel numerical method that relaxes the minimal configuration requirements for joint sensors and sources localization (JSSL) in 3D space using time of arrival (TOA) measurements. Traditionally, the principle requires that the number of valid equations (TOA measurements) must be equal to or greater than the number of unknown variables (sensor and source locations). State-of-the-art literature suggests that the minimum numbers of sensors and sources needed for localization are four to six and six to four, respectively. However, these stringent configurations limit the application of JSSL in scenarios with an insufficient number of sensors and sources. To overcome this limitation, we propose a numerical method that reduces the required number of sensors and sources, enabling more flexible JSSL configurations. First, we formulate the JSSL task as a series of triangles and apply the law of cosines to determine four unknown distances associated with one pair of sensors and three pairs of sources. Next, by utilizing triangle inequalities, we establish the lower and upper boundaries for these unknowns based on the known TOA measurements. The numerical method then searches within these boundaries to find the global optimal solutions, demonstrating that JSSL in 3D space is achievable with only four sensors and four sources, thus significantly relaxing the minimal configuration requirements. Theoretical proofs and simulation results confirm the feasibility and effectiveness of the proposed method.

Subjects: Signal Processing ; Audio and Speech Processing

Publish: 2024-10-13 20:38:01 UTC