No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Visual reasoning demands precise object localization and sophisticated spatial relationship modeling; however, existing approaches either rely heavily on large-scale image-question-answer supervision or suffer from logical inconsistencies and localization inaccuracies due to program synthesis. This paper introduces the first unsupervised, end-to-end trainable framework for visual reasoning. It establishes a dual-verifier architecture integrating a large language model (LLM) and a vision-language model (VLM), where reinforcement learning optimizes the LLM’s reasoning chain while the VLM drives hard negative mining. The framework further unifies programmatic subtask decomposition with zero-shot spatial relation modeling. Crucially, it eliminates the need for image-question-answer pairs, jointly enhancing logical reasoning and visual grounding capabilities. Evaluated across diverse spatial reasoning benchmarks, our method significantly outperforms leading open-source and proprietary models. With enhanced visual grounding, it surpasses purely text-based visual reasoning approaches.

Technology Category

Application Category

📝 Abstract

Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/

Problem

Research questions and friction points this paper is trying to address.

Develop annotation-free training for visual reasoning

Enhance reasoning and grounding with AI verifiers

Improve performance across spatial reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-free training with AI-powered multimodal verifiers

LLM verifier refines reasoning via reinforcement learning

VLM verifier strengthens grounding via hard-negative mining

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts