Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses the challenge of evaluating end-to-end performance in AI interview systems employing cascaded speech-to-text (STT), large language model (LLM), and text-to-speech (TTS) components. Leveraging 300,000 real-world interviews, we propose the first automated, multimodal dialogue evaluation framework tailored for voice-based AI interviews. Methodologically, we introduce an LLM-as-a-Judge paradigm to uniformly quantify dialogue quality, technical accuracy, and psychometric validity of competency assessment. Experimental results show that the Google STT–GPT-4.1 pipeline significantly outperforms alternatives across multiple objective metrics; however, objective scores exhibit only weak correlation with user satisfaction—highlighting the critical role of non-technical factors such as interaction naturalness and feedback latency. Our contributions include: (1) empirical validation of optimization pathways for cascaded architectures; (2) a human-centered evaluation methodology that jointly prioritizes technical robustness and experiential design; and (3) a reusable, open framework for multimodal conversational AI assessment.

Technology Category

Application Category

📝 Abstract

Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.

Problem

Research questions and friction points this paper is trying to address.

Evaluate STT x LLM x TTS combinations for AI interview systems

Assess conversational quality and technical accuracy automatically

Investigate user satisfaction versus technical performance correlation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated evaluation framework with LLM-as-a-Judge

Large-scale empirical comparison of STT x LLM x TTS stacks

Google STT paired with GPT-4.1 outperforms alternatives

🔎 Similar Papers

Recent Advances in Speech Language Models: A Survey