🤖 AI Summary
Existing humanoid robot policies often rely on reference motions or task-specific rewards, limiting their ability to generalize across object geometries and compose diverse skills over extended time horizons. This work proposes the first unified interaction representation based on distance fields, leveraging geometric cues—such as surface distance, gradient, and velocity decomposition—to drive a single whole-body humanoid policy without requiring reference motions. By integrating a variational autoencoder (VAE), an adversarial interaction prior, and DAgger-style distillation, and by aligning egocentric depth features across domains, the framework enables seamless sim-to-real transfer using only visual inputs. Evaluated on objects scaled from 0.4× to 1.6×, the system achieves 80–100% success rates on PickUp and SitStand tasks, 62.1% on multi-task composition trajectories, and supports continuous execution of skill sequences up to 40 steps long.
📝 Abstract
Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.