BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses key challenges in automating biological experiments—namely, unstructured protocols, difficulties in recognizing transparent or reflective labware, and the lack of state awareness in multi-step procedures—by proposing the first protocol-centric embodied multi-agent framework. The system integrates protocol parsing, vision-based state verification, and action execution into a closed-loop workflow, enabling flexible and robust laboratory automation without reliance on expensive hardware or fixed pipelines. Core innovations include a customized LLM-based protocol agent, a VLM-RAG mechanism for visual state validation, a lightweight VLA policy for execution, and AugSmolVLA, an online visual augmentation method tailored to mitigate wet-lab visual interference. Evaluated on a benchmark comprising 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, AugSmolVLA substantially outperforms ACT, X-VLA, and the original SmolVLA, achieving higher precision in manipulating transparent objects and enhanced robustness under both normal and high-exposure imaging conditions.

📝 Abstract

Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.

Problem

Research questions and friction points this paper is trying to address.

biological laboratory automation

embodied AI

vision-language-action

protocol-driven execution

wet-lab manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

protocol-driven automation

closed-loop verification