🤖 AI Summary
To address pedestrian trajectory prediction under resource-constrained conditions, this paper proposes a multimodal knowledge distillation framework that enables a lightweight student model—accepting only trajectory or pose inputs—to effectively absorb motion patterns and social interaction knowledge learned by a multimodal teacher model integrating trajectory, human pose, and textual descriptions. Innovatively, it introduces text modality into multimodal distillation for the first time, decouples individual motion modeling from group interaction modeling, and supports dual-path training using both vision-language model (VLM)-generated and human-annotated text. The method combines graph neural networks with behavioral modeling, eliminating online VLM inference. Evaluated on JRDB, SIT, and ETH/UCY benchmarks, the student model achieves up to 13% improvement in average displacement error (ADE) and final displacement error (FDE), significantly outperforming unimodal baselines while maintaining high accuracy and edge-deployment feasibility.
📝 Abstract
Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.