Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the fragility of small-scale vision-language models in visual perception and tool coordination, which typically rely heavily on costly supervised trajectory fine-tuning. The authors propose SPECTRA, a novel framework that achieves self-guided learning without human preference labels for the first time. SPECTRA employs unsupervised cold-start reinforcement learning to steer agents in explicitly orchestrating tool-derived evidence according to topological structure and synthesizing reasoning. It incorporates soft-structured multi-turn trajectory constraints, a multi-objective reward mechanism encompassing task correctness, trajectory structure, and tool utility, and introduces a new Tool Instruction Utility (TIU) metric. Evaluated on challenging benchmarks such as MMMU-Pro—including compositional and out-of-distribution settings—SPECTRA improves task accuracy by up to 5% and enhances tool usage efficiency by 9%, significantly advancing the autonomous learning capabilities of multimodal agents.

Technology Category

Application Category

📝 Abstract

Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.

Problem

Research questions and friction points this paper is trying to address.

Small Vision-Language Models

visual brittleness

tool orchestration

supervision-free learning

agentic trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cold-start Reinforcement Learning

Supervision-Free Agentic Trajectories

Soft Structured Multi-turn Rollouts