Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited compositional reasoning capabilities—e.g., distinguishing “a dog chasing a cat” from “a cat chasing a dog”—and significantly underperform humans on benchmarks such as Winoground. To address this, we propose SCRAMBLe, the first fully automated framework for synthesizing binary preference data from image-text pairs without human annotation, enabling distillation of compositional knowledge. SCRAMBLe is the first to systematically integrate preference learning—specifically DPO and RLHF variants—into fine-tuning open-weight MLLMs (e.g., Molmo-7B) for compositional reasoning. Experiments show that SCRAMBLe achieves a new state-of-the-art Winoground accuracy of 54.8%, a +5.3 percentage point improvement, while preserving or slightly improving general VQA performance (+~1%). All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like"dog chasing cat"vs"cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.
Problem

Research questions and friction points this paper is trying to address.

Improving compositional reasoning in vision-language models
Enhancing model accuracy in distinguishing complex visual scenes
Boosting performance on vision-language benchmarks with synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic preference data for MLLMs
Automated binary preference learning
Improves compositional reasoning benchmarks
🔎 Similar Papers
No similar papers found.