🤖 AI Summary
To address limitations of conventional data-driven approaches in cross-domain generalization, long-horizon temporal modeling, and real-time adaptation, this paper proposes the first modular multimodal framework for human behavior prediction tailored to human-robot collaboration scenarios, enabling context-aware forecasting. Methodologically, it integrates vision-language inputs with in-context learning (ICL), lightweight adapter-based fine-tuning, and hierarchical temporal encoding—thereby avoiding the computational overhead of full-parameter fine-tuning and mitigating bottlenecks in modeling extended sequences. Our contributions include: (i) a systematic evaluation of pretrained multimodal large language models (MLLMs) for zero- and few-shot behavior prediction; and (ii) the first empirical characterization of how ICL and autoregressive decoding strategies affect behavioral semantic consistency. On standard benchmarks, our approach achieves 92.8% semantic similarity and 66.1% exact-label accuracy, significantly outperforming traditional baselines.
📝 Abstract
Predicting human behavior in shared environments is crucial for safe and efficient human-robot interaction. Traditional data-driven methods to that end are pre-trained on domain-specific datasets, activity types, and prediction horizons. In contrast, the recent breakthroughs in Large Language Models (LLMs) promise open-ended cross-domain generalization to describe various human activities and make predictions in any context. In particular, Multimodal LLMs (MLLMs) are able to integrate information from various sources, achieving more contextual awareness and improved scene understanding. The difficulty in applying general-purpose MLLMs directly for prediction stems from their limited capacity for processing large input sequences, sensitivity to prompt design, and expensive fine-tuning. In this paper, we present a systematic analysis of applying pre-trained MLLMs for context-aware human behavior prediction. To this end, we introduce a modular multimodal human activity prediction framework that allows us to benchmark various MLLMs, input variations, In-Context Learning (ICL), and autoregressive techniques. Our evaluation indicates that the best-performing framework configuration is able to reach 92.8% semantic similarity and 66.1% exact label accuracy in predicting human behaviors in the target frame.