Towards Aligning Multimodal LLMs with Human Experts: A Focus on Parent-Child Interaction

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capability boundaries and human-AI alignment pathways of multimodal large language models (MLLMs) in understanding joint attention—a foundational social behavior in parent–child interactions. Addressing the lack of interpretability and expert-in-the-loop mechanisms in current MLLM-based social cognition, we propose a two-stage prompting strategy: Stage 1 elicits fine-grained behavioral observations (e.g., gaze, gesture, vocalization); Stage 2 prompts social inference based on those observations. Through video annotation, in-depth interviews, and inter-rater agreement assessment with speech-language pathologists, we find that MLLMs achieve high observational alignment with experts (Cohen’s κ > 0.75), yet exhibit limited judgment-level alignment due to heterogeneity in expert interpretive standards. This work is the first to systematically characterize the “strong observation, weak inference consensus” duality of MLLMs in social behavior analysis, providing both theoretical foundations and methodological paradigms for developing trustworthy human-AI collaborative assessment frameworks.

Technology Category

Application Category

📝 Abstract
While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.
Problem

Research questions and friction points this paper is trying to address.

Aligning multimodal LLMs with speech-language pathologists' analysis
Evaluating AI understanding of joint attention in parent-child interactions
Testing expert-AI alignment in observation versus judgment layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage prompting separates observation from judgment
Aligning MLLMs with expert annotations of social interactions
Testing multimodal LLMs on gaze, action, and vocalization cues
🔎 Similar Papers
No similar papers found.