Processing math: 100%

Sound

2025-04-02 | | Total: 5

#1 A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives: Data, Methods, and Challenges [PDF3] [Copy] [Kimi3] [REL]

Authors: Shuyu Li, Shulei Ji, Zihao Wang, Songruoyao Wu, Jiaxing Yu, Kejun Zhang

Multi-modal music generation, using multiple modalities like images, video, and text alongside musical scores and audio as guidance, is an emerging research area with broad applications. This paper reviews this field, categorizing music generation systems from the perspective of modalities. It covers modality representation, multi-modal data alignment, and their utilization to guide music generation. We also discuss current datasets and evaluation methods. Key challenges in this area include effective multi-modal integration, large-scale comprehensive datasets, and systematic evaluation methods. Finally, we provide an outlook on future research directions focusing on multi-modal fusion, alignment, data, and evaluation.

Subjects: Sound , Artificial Intelligence , Multimedia

Publish: 2025-04-01 14:26:25 UTC


#2 C2AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [PDF2] [Copy] [Kimi1] [REL]

Authors: Wenxuan Wu, Xueyuan Chen, Shuai Wang, Jiadong Wang, Lingwei Meng, Xixin Wu, Helen Meng, Haizhou Li

Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.

Subjects: Sound , Machine Learning , Multimedia , Audio and Speech Processing

Publish: 2025-04-01 13:01:30 UTC


#3 User authentication on earable devices via bone-conducted occlusion sounds [PDF] [Copy] [Kimi] [REL]

Authors: Yadong Xie, Fan Li, Yue Wu, Yu Wang

With the rapid development of mobile devices and the fast increase of sensitive data, secure and convenient mobile authentication technologies are desired. Except for traditional passwords, many mobile devices have biometric-based authentication methods (e.g., fingerprint, voiceprint, and face recognition), but they are vulnerable to spoofing attacks. To solve this problem, we study new biometric features which are based on the dental occlusion and find that the bone-conducted sound of dental occlusion collected in binaural canals contains unique features of individual bones and teeth. Motivated by this, we propose a novel authentication system, TeethPass+, which uses earbuds to collect occlusal sounds in binaural canals to achieve authentication. First, we design an event detection method based on spectrum variance to detect bone-conducted sounds. Then, we analyze the time-frequency domain of the sounds to filter out motion noises and extract unique features of users from four aspects: teeth structure, bone structure, occlusal location, and occlusal sound. Finally, we train a Triplet network to construct the user template, which is used to complete authentication. Through extensive experiments including 53 volunteers, the performance of TeethPass+ in different environments is verified. TeethPass+ achieves an accuracy of 98.6% and resists 99.7% of spoofing attacks.

Subject: Sound

Publish: 2025-04-01 05:38:41 UTC


#4 Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks [PDF1] [Copy] [Kimi1] [REL]

Authors: Yongyi Zang, Sean O'Brien, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack

Large Audio Language Models (LALMs), where pretrained text LLMs are finetuned with audio input, have made remarkable progress in music understanding. However, current evaluation methodologies exhibit critical limitations: on the leading Music Question Answering benchmark, MuchoMusic, text-only LLMs without audio perception capabilities achieve surprisingly high accuracy of up to 56.4%, on par or above most LALMs. Furthermore, when presented with random Gaussian noise instead of actual audio, LALMs still perform significantly above chance. These findings suggest existing benchmarks predominantly assess reasoning abilities rather than audio perception. To overcome this challenge, we present RUListening: Robust Understanding through Listening, a framework that enhances perceptual evaluation in Music-QA benchmarks. We introduce the Perceptual Index (PI), a quantitative metric that measures a question's reliance on audio perception by analyzing log probability distributions from text-only language models. Using this metric, we generate synthetic, challenging distractors to create QA pairs that necessitate genuine audio perception. When applied to MuchoMusic, our filtered dataset successfully forces models to rely on perceptual information-text-only LLMs perform at chance levels, while LALMs similarly deteriorate when audio inputs are replaced with noise. These results validate our framework's effectiveness in creating benchmarks that more accurately evaluate audio perception capabilities.

Subject: Sound

Publish: 2025-04-01 02:34:19 UTC


#5 Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems [PDF] [Copy] [Kimi1] [REL]

Authors: Weifei Jin, Yuxin Cao, Junjie Su, Derui Wang, Yedi Zhang, Minhui Xue, Jie Hao, Jin Song Dong, Yixian Yang

The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.

Subjects: Cryptography and Security , Machine Learning , Sound

Publish: 2025-04-01 14:49:39 UTC