ashihara25@interspeech_2025@ISCA

Total: 1

#1 Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains [PDF] [Copy] [Kimi] [REL]

Authors: Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, Shota Horiguchi

Techniques for discrete audio representation, which convert an audio signal into a sequence of audio tokens using neural audio codecs or self-supervised speech models, have gained attention for offering the possibility of modeling audio with large language models (LM) efficiently. While these audio tokens have been studied in various domains (e.g., speech, music, and general sound), their encoding properties across domains remain unclear. This paper examines several audio token types to analyze cross-domain variations. Our major findings include that audio tokens exhibit consistent statistical structures and probabilistic predictability deduced from rank-frequency distribution and perplexity, regardless of the domain. However, the token usage pattern is somewhat domain-dependent. This result underpins the steady success of the versatile audio LM, while also suggesting that domain-aware LM could further optimize performance by better capturing domain-specific token usage distributions.

Subject: INTERSPEECH.2025 - Modelling and Learning