Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

249K/year

🤖 AI Summary

To address computational redundancy, high memory overhead, and system integration complexity in deploying multiple vision tasks on resource-constrained robotic platforms, this paper proposes an efficient modular visual perception engine. The engine adopts DINOv2 as a shared backbone and is implemented within a ROS2 (Humble) C++ framework, integrating CUDA Multi-Process Service (MPS) and TensorRT optimizations to support hybrid Python/C++ programming. It eliminates redundant data transfers via feature reuse and zero-copy multi-task heads. Furthermore, it introduces dynamic priority-based scheduling and runtime GPU frequency modulation to enhance GPU utilization and system adaptability. Evaluated on the Jetson Orin AGX, the engine achieves end-to-end real-time inference at ≥50 Hz; compared to serial execution, it delivers up to 3× speedup while maintaining constant memory footprint. The design significantly improves multi-task coordination efficiency and deployment scalability for embedded robotic systems.

Technology Category

Application Category

📝 Abstract

Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework's capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at $geq$50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundant computations in robotic vision tasks

Minimizes memory usage for multiple ML models

Simplifies integration of parallel task-specific model heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared foundation model backbone for multitasking

Dynamic task prioritization via parallel heads

CUDA MPS for efficient GPU utilization

🔎 Similar Papers

No similar papers found.