DRES: Benchmarking LLMs for Disfluency Removal

#1 DRES: Benchmarking LLMs for Disfluency Removal [PDF] [Copy] [Kimi³] [REL]

Authors: Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee

Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.

Subjects: Computation and Language , Artificial Intelligence , Audio and Speech Processing

Publish: 2025-09-24 17:08:12 UTC

2509.20321

#1 DRES: Benchmarking LLMs for Disfluency Removal [PDF] [Copy] [Kimi3] [REL]

#1 DRES: Benchmarking LLMs for Disfluency Removal [PDF] [Copy] [Kimi³] [REL]