🤖 AI Summary
Existing mathematical reasoning benchmarks are susceptible to model memorization, while manually crafting high-quality novel problems is prohibitively expensive. This work proposes an “answer inversion” mechanism that automatically generates verifiable new problems by masking numerical values in original questions and reconstructing them conditioned on the original answers, thereby transforming the masked values into new answers. The approach integrates problem reconstruction, answer masking, and conditional rewriting, augmented with reinforcement learning for data enhancement, offering dual utility for both evaluation and training. Experiments demonstrate that this method significantly improves model performance across multiple mathematical reasoning benchmarks and reveals distinct behavioral patterns of models on inverted problems, thereby validating its effectiveness and analytical power.
📝 Abstract
Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.