MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing work lacks effective tools for evaluating cross-modal alignment among skeletal motion, textual intent, and 3D scenes. To address this, we propose the first tri-modal joint retrieval framework that constructs a unified latent space to model high-order semantic dependencies across motion, intent, and scene. Methodologically, our approach integrates unimodal representation learning with cross-modal contrastive learning, employing Transformer-based co-encoding for all three modalities; it further introduces a relation-aware scoring mechanism to ensure retrieval outputs align closely with human preferences. Experiments demonstrate substantial improvements over unimodal baselines on cross-modal retrieval tasks, while enabling zero-shot in-scene object placement and action description generation. A user study confirms the framework’s effectiveness in aligning with human cognitive modeling. This work establishes a foundational paradigm for tri-modal alignment in embodied AI, advancing both representational fidelity and perceptual plausibility in cross-modal understanding.

Technology Category

Application Category

📝 Abstract

Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR's latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

Problem

Research questions and friction points this paper is trying to address.

Evaluating alignment between skeletal motion, text intentions, and scene context

Constructing unified latent space for cross-modal motion-scene-text retrieval

Capturing intricate dependencies across modalities for flexible robust retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified latent space for motion, scene, text

Leverages unimodal and cross-modal representations

Enables flexible cross-modal retrieval tasks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs