🤖 AI Summary
This work investigates how post-training can enable Transformer models to learn the Cube-and-Conquer splitting heuristic for Boolean satisfiability (SAT), a domain traditionally dominated by symbolic methods. The authors propose a neurosymbolic framework that integrates reasoning traces generated by Monte Carlo Tree Search (MCTS) with solver-statistics-guided data construction, followed by a two-stage training pipeline comprising supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Evaluated on 100 SAT Competition benchmarks, a 4B-parameter model achieves a pass@5 score of 53, outperforming Claude-Sonnet-4 (50) and matching the best symbolic heuristics. Ablation studies reveal that SFT substantially enhances decision diversity (+5 points), with DPO providing an additional 2-point gain. This study provides the first demonstration that Transformers can effectively acquire SAT cube-splitting strategies through post-training.
📝 Abstract
Despite the effectiveness of Cube-and-Conquer (C&C) for solving challenging Boolean Satisfiability (SAT) problems, no prior work has shown that transformer-based models can learn effective cubing heuristics. We introduce a neuro-symbolic post-training framework for this task. We design an MCTS-based data curation pipeline that uses symbolic heuristics to explore splitting decisions over SAT competition formulas, producing preference data grounded in solver statistics and augmented with reasoning traces from a teacher model. Our two-stage post-training, supervised fine-tuning (SFT) followed by direct preference optimization (DPO), enables a 4B-parameter model to achieve a pass@5 score of 53 on 100 SAT competition benchmarks, surpassing frontier LLMs such as Claude-Sonnet-4 (50) and matching the best symbolic heuristic (53). Ablations show that SFT alone improves pass@5 from 46 to 51, with DPO adding 2 additional benchmarks; an entropy/agreement ablation on realized first-cube decisions further shows that SFT, not DPO, accounts for the root-level decision diversity that produces complementary per-run coverage over deterministic symbolic methods. This demonstrates that transformers can be trained to make effective cubing decisions in a domain traditionally dominated by symbolic methods.