StreamingClaw Technical Report

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fragmentation among perception, memory, and action capabilities in existing embodied intelligence systems during streaming video understanding and real-time interaction. To bridge this gap, we propose StreamingClaw, a unified end-to-end closed-loop framework that integrates streaming multimodal memory, online object-evolution-driven real-time reasoning, future event prediction, and an action-centric skill library. StreamingClaw enables low-latency perception-decision-action cycles, supports multi-agent collaboration through shared memory, and is compatible with the OpenClaw open-source ecosystem. The framework facilitates direct control over physical environments and provides an efficient, deployable solution for scalable embodied intelligence.

Technology Category

Application Category

📝 Abstract
Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.
Problem

Research questions and friction points this paper is trying to address.

streaming video understanding
embodied intelligence
real-time reasoning
multimodal memory
proactive interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video understanding
embodied intelligence
multimodal memory
real-time reasoning
proactive interaction
🔎 Similar Papers
No similar papers found.
J
Jiawei Chen
Z
Zhe Chen
Chaoqun Du
Chaoqun Du
Department of Automation, Tsinghua University
M
Maokui He
W
Wei He
H
Hengtao Li
Q
Qizhen Li
Zide Liu
Zide Liu
Zhejiang University
Diffusion ModelsVideo Editing
H
Hao Ma
X
Xuhao Pan
C
Chang Ren
X
Xudong Rao
X
Xintian Shen
C
Chenfeng Wang
T
Tao Wei
C
Chengjun Yu
Pengfei Yu
Pengfei Yu
University of Illinois at Urbana-Champaign
Natural Language ProcessingMachine Learning
S
Shengyu Yao
C
Chunpeng Zhou
K
Kun Zhan
L
Lihao Zheng
P
Pan Zhou
Xuhan Zhu
Xuhan Zhu
UCAS
Computer VisionVision Language Model
Y
Yufei Zheng