StreamingClaw Technical Report

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the fragmentation among perception, memory, and action capabilities in existing embodied intelligence systems during streaming video understanding and real-time interaction. To bridge this gap, we propose StreamingClaw, a unified end-to-end closed-loop framework that integrates streaming multimodal memory, online object-evolution-driven real-time reasoning, future event prediction, and an action-centric skill library. StreamingClaw enables low-latency perception-decision-action cycles, supports multi-agent collaboration through shared memory, and is compatible with the OpenClaw open-source ecosystem. The framework facilitates direct control over physical environments and provides an efficient, deployable solution for scalable embodied intelligence.

Technology Category

Application Category

📝 Abstract

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

Problem

Research questions and friction points this paper is trying to address.

streaming video understanding

embodied intelligence

real-time reasoning

multimodal memory

proactive interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

streaming video understanding

embodied intelligence

multimodal memory

real-time reasoning

proactive interaction

🔎 Similar Papers

Sorting-based FPGA Sliding Window Aggregation Engine without off-chip Memories

2024-05-28Citations: 0