🤖 AI Summary
To address high user-perceived latency (UPL) in spoken dialogue systems, this paper proposes a response prefetching mechanism jointly modeling semantic similarity and language model token-level confidence. The core contribution is the Prediction Confidence Model (PCM), which dynamically triggers response generation by real-time assessment of semantic similarity between partial speech stream predictions and ground-truth utterances, integrated with token-level confidence scores from the language model. This enables semantically reliable prediction and precomputation of responses *before* user utterance completion, avoiding speculative prefetching. Experiments demonstrate that PCM significantly improves prefetching accuracy, reduces redundant computation, lowers average UPL by 23.6%, and decreases response first-token latency by 19.4%, all while preserving ASR and NLU accuracy—thereby enhancing end-to-end interaction timeliness.
📝 Abstract
Prefetching of dialogue responses has been investigated to reduce user-perceived latency (UPL), which refers to the user's waiting time before receiving the system's response, in spoken dialogue systems. To reduce the UPL, it is necessary to predict complete user utterances before the end of the user's speech, typically by language models, to prepare prefetched dialogue responses. In this study, we proposed a prediction confidence model (PCM) that determines whether prefetching is possible or not by estimating the semantic similarity between the predicted complete user utterance and the complete user utterance. We evaluated our PCM based on the differences between the predicted complete user utterance and the complete user utterance.