Guideline-Consistent Segmentation via Multi-Agent Refinement

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic segmentation in real-world scenarios demands strict adherence to lengthy, fine-grained textual annotation guidelines—yet existing methods suffer from substantial annotation bias, high retraining costs, and inability of open-vocabulary models to parse paragraph-level rules. To address this, we propose a fine-tuning-free multi-agent iterative optimization framework: a Worker agent performs segmentation using a general-purpose vision-language model; a Supervisor agent retrieves relevant annotation guidelines and provides structured, critique-based feedback; and a lightweight reinforcement learning–based termination policy dynamically controls iteration depth. Our framework achieves, for the first time, zero-shot, fine-grained compliance with long-text guidelines, supporting guideline evolution and cross-task generalization. Extensive experiments on Waymo and ReasonSeg demonstrate significant improvements over state-of-the-art methods, validating its high accuracy, robustness, and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.
Problem

Research questions and friction points this paper is trying to address.

Ensuring semantic segmentation follows complex textual guidelines
Overcoming limitations of traditional task-specific retraining approaches
Addressing failure of open-vocabulary methods with paragraph-length rules
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework coordinates vision-language models
Worker-Supervisor architecture iteratively refines segmentation
Lightweight reinforcement learning stops refinement loop
🔎 Similar Papers
No similar papers found.
Vanshika Vats
Vanshika Vats
Doctoral Student at University of California, Santa Cruz
Computer VisionMachine LearningDeep LearningVision for Autonomous Vehicles
A
Ashwani Rathee
University of California, Santa Cruz
J
James Davis
University of California, Santa Cruz