Scalable, Training-Free Visual Language Robotics: A Modular Multi-Model Framework for Consumer-Grade GPUs

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Current vision-language-action (VLA) systems suffer from high training costs, poor generalization, limited zero-shot scalability, and difficulties in cross-platform transfer. To address these limitations, we propose a training-agnostic, plug-and-play modular VLA framework that leverages only lightweight open-source models—Mini-InternVL, CLIPSeg, Phi-3, and all-MiniLM—to enable real-time, end-to-end mapping from natural language instructions to robot action sequences on consumer-grade GPUs (e.g., RTX 2070 Mobile). Crucially, the framework requires no fine-tuning or retraining, supporting zero-shot task generalization and rapid adaptation across heterogeneous robotic platforms. Experimental results demonstrate robust performance in unseen environments, successfully executing natural language–guided pick-and-place tasks without domain-specific supervision. This approach significantly lowers deployment barriers while enhancing system flexibility, modularity, and scalability—enabling accessible, low-cost VLA deployment beyond specialized hardware or large-scale training infrastructures.

Technology Category

Application Category

📝 Abstract

The integration of language instructions with robotic control, particularly through Vision Language Action (VLA) models, has shown significant potential. However, these systems are often hindered by high computational costs, the need for extensive retraining, and limited scalability, making them less accessible for widespread use. In this paper, we introduce SVLR (Scalable Visual Language Robotics), an open-source, modular framework that operates without the need for retraining, providing a scalable solution for robotic control. SVLR leverages a combination of lightweight, open-source AI models including the Vision-Language Model (VLM) Mini-InternVL, zero-shot image segmentation model CLIPSeg, Large Language Model Phi-3, and sentence similarity model all-MiniLM to process visual and language inputs. These models work together to identify objects in an unknown environment, use them as parameters for task execution, and generate a sequence of actions in response to natural language instructions. A key strength of SVLR is its scalability. The framework allows for easy integration of new robotic tasks and robots by simply adding text descriptions and task definitions, without the need for retraining. This modularity ensures that SVLR can continuously adapt to the latest advancements in AI technologies and support a wide range of robots and tasks. SVLR operates effectively on an NVIDIA RTX 2070 (mobile) GPU, demonstrating promising performance in executing pick-and-place tasks. While these initial results are encouraging, further evaluation across a broader set of tasks and comparisons with existing VLA models are needed to assess SVLR's generalization capabilities and performance in more complex scenarios.

Problem

Research questions and friction points this paper is trying to address.

Robotics

Adaptability

Cost-effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular Framework

Cost-effective Scalability

Adaptive Flexibility

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving