Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing large language models (LLMs) underperform on human-solvable logic puzzles due to insufficient task-specific training data and standardized evaluation benchmarks. Method: We introduce Enigmata, the first scalable synthetic framework for logic puzzle reasoning, covering 7 categories and 36 tasks. It pioneers a “generate–verify” co-synthesis paradigm, integrating a controllable-difficulty puzzle generator with a rule-based automatic verifier, enabling multi-task reinforcement learning with verification rewards (RLVR). We further construct Enigmata-Eval, a novel logic puzzle benchmark, and propose an optimized multi-task RLVR training strategy. Contribution/Results: The fine-tuned Qwen2.5-32B-Enigmata achieves substantial gains on Enigmata-Eval and ARC-AGI—outperforming o1/o3-mini-high by +32.8%—while demonstrating zero-shot transfer to advanced reasoning benchmarks (AIME, GPQA). It also exhibits improved cross-domain generalization and enhanced mathematical and STEM reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.

Problem

Research questions and friction points this paper is trying to address.

Improving LLMs' puzzle reasoning without domain knowledge

Creating scalable synthetic puzzles for multi-task RL training

Enhancing logical reasoning generalization in math and STEM tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic verifiable puzzles for scalable reasoning

Generator-verifier design for automatic evaluation

Multi-task RLVR strategies for enhanced performance

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow