Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

#1 Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition [PDF] [Copy] [Kimi¹] [REL]

Authors: Wonjun Lee, Hyounghun Kim, Gary Geunbae Lee

Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2026-02-02 11:16:34 UTC

2602.01967

#1 Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition [PDF] [Copy] [Kimi1] [REL]

#1 Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition [PDF] [Copy] [Kimi¹] [REL]