Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

#1 Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech [PDF] [Copy] [Kimi²] [REL]

Authors: Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee

Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.

Subjects: Audio and Speech Processing , Sound

Publish: 2025-08-12 14:25:55 UTC

2508.08962

#1 Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech [PDF] [Copy] [Kimi2] [REL]

#1 Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech [PDF] [Copy] [Kimi²] [REL]