LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing humanoid robot policies often rely on reference motions or task-specific rewards, limiting their ability to generalize across object geometries and compose diverse skills over extended time horizons. This work proposes the first unified interaction representation based on distance fields, leveraging geometric cues—such as surface distance, gradient, and velocity decomposition—to drive a single whole-body humanoid policy without requiring reference motions. By integrating a variational autoencoder (VAE), an adversarial interaction prior, and DAgger-style distillation, and by aligning egocentric depth features across domains, the framework enables seamless sim-to-real transfer using only visual inputs. Evaluated on objects scaled from 0.4× to 1.6×, the system achieves 80–100% success rates on PickUp and SitStand tasks, 62.1% on multi-task composition trajectories, and supports continuous execution of skill sequences up to 40 steps long.

Technology Category

Application Category

📝 Abstract
Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.
Problem

Research questions and friction points this paper is trying to address.

Humanoid Interaction
Long-Horizon
Geometric Generalization
Unified Representation
Skill Composition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distance Field
Reference-Free Policy
Geometric Generalization
Adversarial Interaction Priors
Vision-Based Deployment
🔎 Similar Papers
No similar papers found.
Y
Yutang Lin
Institute for AI, Peking University; Beijing Institute for General Artificial Intelligence (BIGAI); School of Psychological and Cognitive Sciences, Peking University; State Key Lab of General AI; Beijing Key Laboratory of Behavior and Mental Health, Peking University; Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence
Jieming Cui
Jieming Cui
Peking University
Y
Yixuan Li
School of Computer Science and Technology, Beijing Institute of Technology; Beijing Institute for General Artificial Intelligence (BIGAI); State Key Lab of General AI
Baoxiong Jia
Baoxiong Jia
Ph.D. in Computer Science, UCLA
Computer VisionArtificial Intelligence
Yixin Zhu
Yixin Zhu
Assistant Professor, Peking University
Computer VisionVisual ReasoningHuman-Robot Teaming
Siyuan Huang
Siyuan Huang
Beijing Institute for General Artificial Intelligence (BIGAI)
Embodied AI3D VisionRobotics3D Scene Understanding