ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

📅 2026-04-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the dual challenges of scarce multi-source heterogeneous data and the demand for low-latency, high-fidelity motion generation in real-time human interaction response synthesis. The authors propose ReMoGen, a modular framework that leverages large-scale single-person motion priors and adapts to diverse interaction scenarios through a plug-and-play Meta-Interaction module. By integrating segmented sequence generation with a lightweight frame-level refinement mechanism, ReMoGen enhances motion coherence and responsiveness while maintaining real-time performance. Notably, the method requires no task-specific training and demonstrates strong cross-domain generalization, producing high-quality, temporally responsive motion sequences across human–human, human–scene, and multimodal interaction tasks, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

Problem

Research questions and friction points this paper is trying to address.

real-time interaction

motion generation

human reaction

multi-source cues

heterogeneous data

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular learning

real-time motion generation

interaction-to-reaction

meta-interaction modules

motion prior adaptation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)