From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

#1 From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines [PDF²] [Copy] [Kimi²] [REL]

Authors: Titaya Mairittha, Tanakon Sawanglok, Panuwit Raden, Jirapast Buntub, Thanapat Warunee, Napat Asawachaisuvikrom, Thanaphum Saiwongin

While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.

Subjects: Human-Computer Interaction , Artificial Intelligence , Computation and Language , Software Engineering

Publish: 2025-12-12 17:05:11 UTC

2512.11724

#1 From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines [PDF2] [Copy] [Kimi2] [REL]

#1 From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines [PDF²] [Copy] [Kimi²] [REL]