🤖 AI Summary
This study addresses the lack of fine-grained modeling of the relationship between the lexical-prosodic forms of backchannels (e.g., “yeah,” “mhm”) and their pragmatic meanings. The authors propose a two-stage framework: first, a large language model is fine-tuned to obtain contextual representations of dialogue; then, contrastive learning is employed to construct a joint embedding space that aligns these contextual representations with backchannel forms. This work presents the first integration of large language models and contrastive learning for this task, incorporating WavLM audio features and human-elicited triplet similarity judgments. The resulting embeddings significantly outperform baseline methods on context–backchannel matching and demonstrate greater alignment with human assessments of backchannel appropriateness.
📝 Abstract
Backchannels (e.g., `yeah', `mhm', and `right') are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage framework: first, fine-tuning large language models on dialogue transcripts to derive rich contextual representations; and second, learning a joint embedding space for dialogue contexts and backchannel realizations. We evaluate alignment with human perception via triadic similarity judgments (prosodic and cross-lexical) and a context-backchannel suitability task. Our results demonstrate that the learned projections substantially improve context-backchannel retrieval compared to previous methods. In addition, they reveal that backchannel form is highly sensitive to extended conversational context and that the learned embeddings align more closely with human judgments than raw WavLM features.