Audio and Speech Processing

Date: Thu, 9 May 2024 | Total: 7

#1 SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan [PDF] [Copy] [Kimi]

Authors: You Zhang ; Yongyi Zang ; Jiatong Shi ; Ryuichi Yamamoto ; Jionghao Han ; Yuxun Tang ; Tomoki Toda ; Zhiyao Duan

The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).

#2 HILCodec: High Fidelity and Lightweight Neural Audio Codec [PDF] [Copy] [Kimi]

Authors: Sunghwan Ahn ; Beom Jun Woo ; Min Hyun Han ; Chanyeong Moon ; Nam Soo Kim

The recent advancement of end-to-end neural audio codecs enables compressing audio at very low bitrates while reconstructing the output audio with high fidelity. Nonetheless, such improvements often come at the cost of increased model complexity. In this paper, we identify and address the problems of existing neural audio codecs. We show that the performance of Wave-U-Net does not increase consistently as the network depth increases. We analyze the root cause of such a phenomenon and suggest a variance-constrained design. Also, we reveal various distortions in previous waveform domain discriminators and propose a novel distortion-free discriminator. The resulting model, \textit{HILCodec}, is a real-time streaming audio codec that demonstrates state-of-the-art quality across various bitrates and audio types.

#3 SingIt! Singer Voice Transformation [PDF] [Copy] [Kimi]

Authors: Amit Eliav ; Aaron Taub ; Renana Opochinsky ; Sharon Gannot

In this paper, we propose a model which can generate a singing voice from normal speech utterance by harnessing zero-shot, many-to-many style transfer learning. Our goal is to give anyone the opportunity to sing any song in a timely manner. We present a system comprising several available blocks, as well as a modified auto-encoder, and show how this highly-complex challenge can be achieved by tailoring rather simple solutions together. We demonstrate the applicability of the proposed system using a group of 25 non-expert listeners. Samples of the data generated from our model are provided.

#4 An LSTM-Based Chord Generation System Using Chroma Histogram Representations [PDF] [Copy] [Kimi]

Author: Jack Hardwick

This paper proposes a system for chord generation to monophonic symbolic melodies using an LSTM-based model trained on chroma histogram representations of chords. Chroma representations promise more harmonically rich generation than chord label-based approaches, whilst maintaining a small number of dimensions in the dataset. This system is shown to be suitable for limited real-time use. While it does not meet the state-of-the-art for coherent long-term generation, it does show diatonic generation with cadential chord relationships. The need for further study into chroma histograms as an extracted feature in chord generation tasks is highlighted.

#5 Exploring Speech Pattern Disorders in Autism using Machine Learning [PDF] [Copy] [Kimi]

Authors: Chuanbo Hu ; Jacob Thrasher ; Wenqi Li ; Mindi Ruan ; Xiangxu Yu ; Lynn K Paul ; Shuo Wang ; Xin Li

Diagnosing autism spectrum disorder (ASD) by identifying abnormal speech patterns from examiner-patient dialogues presents significant challenges due to the subtle and diverse manifestations of speech-related symptoms in affected individuals. This study presents a comprehensive approach to identify distinctive speech patterns through the analysis of examiner-patient dialogues. Utilizing a dataset of recorded dialogues, we extracted 40 speech-related features, categorized into frequency, zero-crossing rate, energy, spectral characteristics, Mel Frequency Cepstral Coefficients (MFCCs), and balance. These features encompass various aspects of speech such as intonation, volume, rhythm, and speech rate, reflecting the complex nature of communicative behaviors in ASD. We employed machine learning for both classification and regression tasks to analyze these speech features. The classification model aimed to differentiate between ASD and non-ASD cases, achieving an accuracy of 87.75%. Regression models were developed to predict speech pattern related variables and a composite score from all variables, facilitating a deeper understanding of the speech dynamics associated with ASD. The effectiveness of machine learning in interpreting intricate speech patterns and the high classification accuracy underscore the potential of computational methods in supporting the diagnostic processes for ASD. This approach not only aids in early detection but also contributes to personalized treatment planning by providing insights into the speech and communication profiles of individuals with ASD.

#6 The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio [PDF] [Copy] [Kimi]

Authors: Yuankun Xie ; Yi Lu ; Ruibo Fu ; Zhengqi Wen ; Zhiyong Wang ; Jianhua Tao ; Xin Qi ; Xiaopeng Wang ; Yukun Liu ; Haonan Cheng ; Long Ye ; Yi Sun

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD) models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.

#7 Dichotic harmony for the musical practice [PDF] [Copy] [Kimi]

Author: Vadim R. Madgazin

The dichotic method of hearing sound adapts in the region of musical harmony. The algorithm of the separation of the being dissonant voices into several separate groups is proposed. For an increase in the pleasantness of chords the different groups of voices are heard out through the different channels of headphones. Is created two demonstration program for PC. Keywords: music, harmony, chord, dichotic listening, dissonance, consonance, headphones, pleasantness, midi.