Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiencies in streaming multimodal perception—namely, cumulative latency, redundant information processing, and suboptimal computational utilization caused by per-frame execution. To mitigate these issues, the authors propose a lightweight perception scheduling framework that dynamically evaluates the relevance of individual perception modules based on scene context and outputs from preceding frames, activating only those deemed necessary. By integrating a relevance-driven mechanism with an information sparsity prior, the approach enables efficient context-aware scheduling. Experimental results demonstrate that, compared to conventional parallel pipelines, the proposed method reduces system computation latency by up to 27.52%, improves MMPose activation recall by 72.73%, and achieves a keyframe accuracy of 98%.

Technology Category

Application Category

📝 Abstract
In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.
Problem

Research questions and friction points this paper is trying to address.

multimodal streaming perception
human-robot collaboration
perception latency
computational resource allocation
information redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

relevance-driven scheduling
multimodal streaming perception
perception latency reduction
context-aware module activation
human-robot collaboration
D
Dingcheng Huang
Mechatronics Research Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
X
Xiaotong Zhang
Mechatronics Research Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Kamal Youcef-Toumi
Kamal Youcef-Toumi
Professor of Mechanical Engineering, MIT
Robotics and Control Systems