Extended Loss: Incorporating Long Context into Training Models when using Short Audio Frames

#1 Extended Loss: Incorporating Long Context into Training Models when using Short Audio Frames [PDF] [Copy] [Kimi] [REL]

Authors: Quang Minh Dinh, Hoda Rezaee Kaviani, Mehrdad Hosseinzadeh, Yuanhao Yu

Recently deep learning solutions have been successfully applied to many signal processing tasks including acoustic echo cancellation (AEC). Most existing works focus on architecture design and ignore practical issues such as the effect of frame length on the performance of end-to-end AEC models. In real-time applications, frame length can be as small as 10ms. Since the observed context is very limited during training, it results in boundary discontinuities (glitches) in the final output. While using long frames or post-processing can help, it adds extra delay which may not be desirable depending on the application. In this paper, we investigate the practical issue of handling short frames for AEC, and propose an efficient remedy. By keeping the long context information in each batch and using it during loss calculation, we compensate for the short frames. Our solution is model-agnostic and does not affect the inference time.

Subject: INTERSPEECH.2025 - Speech Processing

dinh25@interspeech_2025@ISCA

#1 Extended Loss: Incorporating Long Context into Training Models when using Short Audio Frames [PDF] [Copy] [Kimi] [REL]