🤖 AI Summary
This study addresses the challenge of evaluating end-to-end performance in AI interview systems employing cascaded speech-to-text (STT), large language model (LLM), and text-to-speech (TTS) components. Leveraging 300,000 real-world interviews, we propose the first automated, multimodal dialogue evaluation framework tailored for voice-based AI interviews. Methodologically, we introduce an LLM-as-a-Judge paradigm to uniformly quantify dialogue quality, technical accuracy, and psychometric validity of competency assessment. Experimental results show that the Google STT–GPT-4.1 pipeline significantly outperforms alternatives across multiple objective metrics; however, objective scores exhibit only weak correlation with user satisfaction—highlighting the critical role of non-technical factors such as interaction naturalness and feedback latency. Our contributions include: (1) empirical validation of optimization pathways for cascaded architectures; (2) a human-centered evaluation methodology that jointly prioritizes technical robustness and experiential design; and (3) a reusable, open framework for multimodal conversational AI assessment.
📝 Abstract
Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.