Action-Prior Denoising for Smooth Real-Time Chunking

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing real-time chunking (RTC) methods rely on binary prefix masks to handle action overlap, which struggles to model the continuous dependency between partially editable actions and previously planned trajectories under asynchronous execution, often resulting in jerky motion. This work proposes Soft RTC, the first approach to introduce soft overlapping windows and an action prior denoising mechanism: during training, partially denoised states simulate overlapping actions, while at inference, a lightweight token-wise fusion strategy injects historical action chunks as priors. By generalizing hard masks to continuous constraints, Soft RTC maintains low inference overhead while jointly preserving action consistency and editability. On the Kinetix benchmark, Soft RTC reduces action variation and jerk by 9.1% and 9.6%, respectively, under high latency, achieving a solve rate of 0.809 (vs. 0.815), with real-robot experiments further confirming its superior performance in task completion and motion smoothness.

📝 Abstract

Real-time chunking (RTC) lets chunked action policies operate under inference delay by conditioning a newly generated action chunk on actions already committed by the previous chunk. Training-time RTC simulates this delay during learning and avoids expensive guidance at deployment, but its binary prefix mask treats all non-prefix tokens as fully unconstrained. This under-models asynchronous execution: early overlap actions are fixed, while later overlap actions remain editable but should still stay close to the previous plan. We propose Soft RTC, a training-time RTC generalization based on action-prior denoising. Soft RTC constructs corrupted overlap tokens from partially denoised states instead of pure noise and injects the aligned previous chunk as the same prior during inference through a lightweight token-wise blending rule. On the 12 released large Kinetix levels, a short soft window nearly matches hard training-time RTC in overall solve rate (0.809 vs. 0.815), while a medium window reduces high-delay action delta and jerk by 9.1% and 9.6% relative to hard RTC. Both variants keep near-naive runtime, unlike inference-time RTC baselines. A small preliminary real-robot sorting study provides additional evidence that training-time RTC can improve completion and that Soft RTC gives the lowest commanded-action finite-difference metrics among the tested policies.

Problem

Research questions and friction points this paper is trying to address.

real-time chunking

action smoothness

asynchronous execution

inference delay

action continuity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft RTC

action-prior denoising

real-time chunking