🤖 AI Summary
This work addresses the high cost of manual coding in Motivational Interviewing (MI) by proposing an efficient, automated solution. It introduces multimodal self-consistency reasoning for MI coding—leveraging audio language models to jointly integrate verbal content and acoustic prosody. The approach employs four distinct prompting strategies—analytical, prosody-aware, evidence-scoring, and contrastive—to generate multiple reasoning trajectories. Robustness is enhanced through stochastic sampling and majority voting across these trajectories. Evaluated on real-world MI recordings, the method achieves 52.56% accuracy and 46.40% macro F1 score, significantly outperforming baseline approaches. Ablation studies further confirm the contribution of each component to the overall performance, demonstrating the effectiveness of the proposed framework for automated MI fidelity assessment.
📝 Abstract
BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.