Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current AI models exhibit limited compositional reasoning capabilities, and mainstream benchmarks often underestimate their true potential. To address this, we propose Test-Time Matching (TTM), a zero-shot, training-free inference-time method that dynamically unlocks latent model capabilities via group-structure matching, contrastive vision-language alignment, and self-iterative optimization. Crucially, TTM introduces a novel group-matching scoring mechanism that enables structured capability elicitation—even under test-time overfitting—without fine-tuning or auxiliary training. Extensive experiments demonstrate consistent performance gains across 16 diverse compositional reasoning benchmarks. Notably, TTM elevates SigLIP-B16 beyond GPT-4.1 and, for the first time, enables GPT-4.1 to surpass human-level accuracy on Winoground. The method establishes new state-of-the-art results on multiple benchmarks, highlighting its effectiveness in enhancing compositional generalization through principled test-time adaptation.

Technology Category

Application Category

📝 Abstract

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addressing underestimation of compositional reasoning in multimodal AI models

Improving evaluation metrics to reveal hidden model capabilities

Developing test-time algorithms to boost performance without supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group matching score exploits group structure

Test-time matching iteratively self-improves performance

Algorithm boosts compositional reasoning without supervision

🔎 Similar Papers

No similar papers found.