2507.04094

Total: 1

#1 MMMOS: Multi-domain Multi-axis Audio Quality Assessment [PDF1] [Copy] [Kimi] [REL]

Authors: Yi-Cheng Lin, Jia-Hung Chen, Hung-yi Lee

Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall's {\tau} versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.

Subjects: Audio and Speech Processing , Artificial Intelligence , Computation and Language

Publish: 2025-07-05 16:42:09 UTC