Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

#1 Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Rumi Allbert, Nima Yazdani, Ali Ansari, Aruj Mahajan, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi

Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.

Subjects: Audio and Speech Processing , Computation and Language

Publish: 2025-07-15 22:30:55 UTC

2507.16835

#1 Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems [PDF1] [Copy] [Kimi1] [REL]

#1 Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems [PDF¹] [Copy] [Kimi¹] [REL]