Moving by Looking: Towards Vision-Driven Avatar Motion Generation

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing action generation methods neglect the coupling between perception and action, relying on dehumanized, task-specific perception modules, which leads to unnatural virtual agent behavior. This paper introduces CLOPS, the first framework enabling human-like navigation and motor control from egocentric visual input alone. Its core innovation lies in decoupling low-level motor skills from high-level visual decision-making: a motion prior model is learned from large-scale motion-capture data, while a Q-learning-driven policy network maps visual observations to high-level control commands. Experiments demonstrate that CLOPS autonomously avoids obstacles and generates scene-adaptive, naturalistic locomotion, significantly improving behavioral human-likeness. By breaking the traditional perception–action separation paradigm, this work establishes a novel pathway for human-like embodied agent modeling.

Technology Category

Application Category

📝 Abstract

The way we perceive the world fundamentally shapes how we move, whether it is how we navigate in a room or how we interact with other humans. Current human motion generation methods, neglect this interdependency and use task-specific ``perception''that differs radically from that of humans. We argue that the generation of human-like avatar behavior requires human-like perception. Consequently, in this work we present CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate. Using vision as the primary driver of motion however, gives rise to a significant challenge for training avatars: existing datasets have either isolated human motion, without the context of a scene, or lack scale. We overcome this challenge by decoupling the learning of low-level motion skills from learning of high-level control that maps visual input to motion. First, we train a motion prior model on a large motion capture dataset. Then, a policy is trained using Q-learning to map egocentric visual inputs to high-level control commands for the motion prior. Our experiments empirically demonstrate that egocentric vision can give rise to human-like motion characteristics in our avatars. For example, the avatars walk such that they avoid obstacles present in their visual field. These findings suggest that equipping avatars with human-like sensors, particularly egocentric vision, holds promise for training avatars that behave like humans.

Problem

Research questions and friction points this paper is trying to address.

Generating human-like avatar motion using human-like perception

Overcoming dataset limitations for vision-driven motion training

Mapping egocentric visual inputs to natural motion control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses egocentric vision for perception and navigation

Decouples low-level motion skills from high-level control

Trains policy mapping visual inputs to motion commands

🔎 Similar Papers

No similar papers found.