PuzzleClone: An SMT-Powered Framework for Synthesizing Verifiable Data

πŸ“… 2025-08-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing mathematical and logical reasoning datasets suffer from low reliability, insufficient diversity, and poor scalability, hindering the advancement of large language models’ (LLMs) formal reasoning capabilities. To address these limitations, we propose an SMT (Satisfiability Modulo Theories)-based formal data generation framework: seed puzzles are first encoded as structured logical specifications; then, systematic randomization of variables and constraints generates scalable puzzle variants; finally, procedural reproduction and automated verification ensure correctness. The resulting dataset comprises over 83,000 high-quality, automatically verified puzzles spanning diverse logical and mathematical reasoning tasks. On the PuzzleClone test set, post-training with our data boosts model accuracy from 14.4% to 56.2%. Moreover, improvements of up to 12.5 percentage points are observed across seven mainstream mathematical and logical reasoning benchmarks. These results demonstrate the effectiveness and scalability of our approach in enhancing LLMs’ formal reasoning performance.

Technology Category

Application Category

πŸ“ Abstract
High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using Satisfiability Modulo Theories (SMT). Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct a curated benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. We conduct post training (SFT and RL) on PuzzleClone datasets. Experimental results show that training on PuzzleClone yields substantial improvements not only on PuzzleClone testset but also on logic and mathematical benchmarks. Post training raises PuzzleClone average from 14.4 to 56.2 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 12.5 absolute percentage points (AMC2023 from 52.5 to 65.0). Our code and data are available at https://github.com/puzzleclone.
Problem

Research questions and friction points this paper is trying to address.

Creating verifiable datasets for LLM reasoning enhancement
Addressing limited reliability and diversity in LLM-generated data
Synthesizing scalable puzzle variants with SMT-based framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

SMT-based framework for verifiable data synthesis
Encoding puzzles into structured logical specifications
Systematic randomization and reproduction for validity
πŸ”Ž Similar Papers
No similar papers found.
K
Kai Xiong
HiThink Research
Yanwei Huang
Yanwei Huang
Zhejiang University
Rongjunchen Zhang
Rongjunchen Zhang
Hihitnk Research
NLPMulti-modal LLMComputer VisionLLM
K
Kun Chen
HiThink Research
H
Haipang Wu
HiThink Research