🤖 AI Summary
Existing large language models (LLMs) underperform on human-solvable logic puzzles due to insufficient task-specific training data and standardized evaluation benchmarks.
Method: We introduce Enigmata, the first scalable synthetic framework for logic puzzle reasoning, covering 7 categories and 36 tasks. It pioneers a “generate–verify” co-synthesis paradigm, integrating a controllable-difficulty puzzle generator with a rule-based automatic verifier, enabling multi-task reinforcement learning with verification rewards (RLVR). We further construct Enigmata-Eval, a novel logic puzzle benchmark, and propose an optimized multi-task RLVR training strategy.
Contribution/Results: The fine-tuned Qwen2.5-32B-Enigmata achieves substantial gains on Enigmata-Eval and ARC-AGI—outperforming o1/o3-mini-high by +32.8%—while demonstrating zero-shot transfer to advanced reasoning benchmarks (AIME, GPQA). It also exhibits improved cross-domain generalization and enhanced mathematical and STEM reasoning capabilities.
📝 Abstract
Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.