🤖 AI Summary
This work addresses Rebus puzzles—a highly challenging multimodal semantic reasoning task integrating images, symbols, and text—for vision-language models (VLMs). We introduce RebusBench, the first large-scale English benchmark for this task, comprising 1,333 puzzles spanning 18 thematic categories, diverse artistic styles, and multiple difficulty levels. To tackle these puzzles, we propose RebusDescProgICE: a framework unifying unstructured natural-language descriptions with code-based structured reasoning, augmented by a reasoning-guided in-context example selection mechanism. The framework integrates multimodal modeling, chain-of-thought prompting, program-aided reasoning, and model-agnostic hybrid inference—ensuring compatibility with both proprietary and open-source VLMs. Experiments demonstrate consistent gains: +2.1–4.1% accuracy on closed-source VLMs and +20–30% on open-source VLMs over baseline chain-of-thought methods. Our approach establishes a new paradigm for interpreting complex image-text puns and performing multi-step commonsense reasoning.
📝 Abstract
Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $left|,circlearrowright,oxed{ ext{BUS}},
ight|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $left|,circlearrowright,oxed{ ext{BUS}},
ight|$ by $2.1-4.1%$ and $20-30%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.