Re-Aligning Language to Visual Objects with an Agentic Workflow

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Language-oriented detection (LOD) suffers from hallucinations—such as erroneous object names, colors, or shapes—in vision-language model (VLM)-generated descriptions, leading to misalignment between visual and linguistic representations. To address this, we propose Real-LOD, an LLM-driven closed-loop agent that dynamically refines multimodal prompts via a three-stage cycle: planning, tool invocation (VLM-based re-description), and reflection—thereby achieving precise realignment between language expressions and object attributes. Our key contribution is the first neural-symbolic, iterative vision-language (VL) realignment architecture, trained on a compact, high-quality dataset of only 0.18M samples, balancing data efficiency with alignment fidelity. On standard LOD benchmarks, Real-LOD outperforms prior methods by approximately 50%, demonstrating that high-fidelity cross-modal alignment is critical for boosting detection performance.

Technology Category

Application Category

📝 Abstract

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

Problem

Research questions and friction points this paper is trying to address.

Reducing VLM hallucinations in object descriptions

Re-aligning language to visual objects adaptively

Improving LOD model performance via data refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-controlled agentic workflow for realignment

Adaptive image and text prompt adjustment

Cyclic planning, tool use, and reflection steps

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling