Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the limited generalization of existing visuomotor policies under raw RGB inputs, which are often disrupted by irrelevant visual variations such as background clutter or object appearance changes. To enhance robustness to out-of-distribution visual perturbations without altering the underlying policy, the authors propose a task-aware observation interface that leverages a two-level (L0/L1) unified semantic–geometric representation. This interface integrates open-vocabulary semantic segmentation (SAM3) with monocular depth estimation (Depth Anything V3) to construct standardized inputs through semantic recoloring and depth-guided inpainting, effectively decoupling task-relevant from task-irrelevant visual information while preserving standard image format. Evaluated on RoboMimic, ManiSkill, RLBench, and real-world Franka robot tasks, the method maintains in-distribution performance while significantly improving cross-domain visual robustness.

Technology Category

Application Category

📝 Abstract

Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric observation (L1) that remains a standard 3-channel, image-like input. We evaluate on RoboMimic (Lift), ManiSkill YCB grasping under clutter, four RLBench tasks under controlled appearance shifts, and two real-world Franka tasks (ReachX and CloseCabinet). Across benchmarks and policy backbones (Flow Matching Policy and SmolVLA), our interface preserves in-distribution performance while substantially improving robustness under OOD visual shifts.

Problem

Research questions and friction points this paper is trying to address.

visuomotor policy

appearance shift

out-of-distribution robustness

visual overfitting

nuisance visual factors

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-aware observation

semantic-geometric representation

out-of-distribution robustness