🤖 AI Summary
This study investigates the capability boundaries and human-AI alignment pathways of multimodal large language models (MLLMs) in understanding joint attention—a foundational social behavior in parent–child interactions. Addressing the lack of interpretability and expert-in-the-loop mechanisms in current MLLM-based social cognition, we propose a two-stage prompting strategy: Stage 1 elicits fine-grained behavioral observations (e.g., gaze, gesture, vocalization); Stage 2 prompts social inference based on those observations. Through video annotation, in-depth interviews, and inter-rater agreement assessment with speech-language pathologists, we find that MLLMs achieve high observational alignment with experts (Cohen’s κ > 0.75), yet exhibit limited judgment-level alignment due to heterogeneity in expert interpretive standards. This work is the first to systematically characterize the “strong observation, weak inference consensus” duality of MLLMs in social behavior analysis, providing both theoretical foundations and methodological paradigms for developing trustworthy human-AI collaborative assessment frameworks.
📝 Abstract
While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting, separating observation from judgment. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.