ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address insufficient low-level action estimation accuracy in language-guided robotic manipulation, this paper proposes a self-supervised method based on action optical flow modeling and dynamic historical fusion. The approach models actions as video optical flow to formulate a learnable “action flow” representation. It introduces a working memory pool and a multi-layer fusion module that enables iterative cross-temporal retrieval, denoising, and fusion of action representations. This design facilitates fine-grained, temporally consistent action estimation optimization. Evaluated on the LIBERO benchmark, the method achieves a 7.9% absolute improvement in task success rate over prior state-of-the-art methods. On the long-horizon challenge set LIBERO-Long, it attains a 7.8% gain in accuracy, demonstrating significantly enhanced robustness and generalization under complex, multi-step language instructions. The contributions lie in (1) formulating action as differentiable optical flow, (2) introducing a memory-augmented temporal fusion architecture, and (3) enabling high-fidelity, temporally coherent action prediction without explicit supervision.

Technology Category

Application Category

📝 Abstract

Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow", in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.

Problem

Research questions and friction points this paper is trying to address.

Improving precision in robot action estimation

Dynamic integration of action flow for manipulation

Enhancing performance in language-instructed robot tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised action flow from videos

Coarse-to-fine iterative flow matching

Dynamic memory-based flow integration

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Boston Dynamics

The base pay range for this position is between $155,000 to $220,000 annually. Base pay will depend on multiple individualized factors including, but not limited to internal equity, job related knowledge, skills and experience. This range represents a good faith estimate of compensation at the time of posting. Boston Dynamics offers a generous Benefits package including medical, dental vision, 401(k), paid time off and a annual bonus structure. Additional details regarding these benefit plans will be provided if an employee receives an offer for employment.

Waltham, MA

Research Scientist Intern, Robotic Control Policy (PhD)