$left|,circlearrowright,oxed{ ext{BUS}}, ight|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses Rebus puzzles—a highly challenging multimodal semantic reasoning task integrating images, symbols, and text—for vision-language models (VLMs). We introduce RebusBench, the first large-scale English benchmark for this task, comprising 1,333 puzzles spanning 18 thematic categories, diverse artistic styles, and multiple difficulty levels. To tackle these puzzles, we propose RebusDescProgICE: a framework unifying unstructured natural-language descriptions with code-based structured reasoning, augmented by a reasoning-guided in-context example selection mechanism. The framework integrates multimodal modeling, chain-of-thought prompting, program-aided reasoning, and model-agnostic hybrid inference—ensuring compatibility with both proprietary and open-source VLMs. Experiments demonstrate consistent gains: +2.1–4.1% accuracy on closed-source VLMs and +20–30% on open-source VLMs over baseline chain-of-thought methods. Our approach establishes a new paradigm for interpreting complex image-text puns and performing multi-step commonsense reasoning.

Technology Category

Application Category

📝 Abstract

Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $left|,circlearrowright,oxed{ ext{BUS}}, ight|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $left|,circlearrowright,oxed{ ext{BUS}}, ight|$ by $2.1-4.1%$ and $20-30%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models' rebus puzzle understanding capabilities

Creating a diverse benchmark with 1,333 puzzles across categories

Proposing framework to improve reasoning performance on puzzles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large diverse benchmark for rebus puzzle evaluation

Model-agnostic framework combining description and code reasoning

Improved performance using reasoning-based context selection

🔎 Similar Papers

No similar papers found.