🤖 AI Summary
Existing benchmarks lack controllable evaluation of joint logical-numerical reasoning, hindering precise characterization of model deficiencies. To address this, we propose LogiNumSynth—the first natural language inference problem generation framework enabling full-dimensional controllable synthesis. It independently modulates logical depth (rule-chain length), world-modeling complexity, and numerical computation difficulty, while simultaneously generating stepwise reasoning traces and final answers. Its modular, rule-guided architecture permits fine-grained intervention, facilitating diagnostic assessment and targeted data augmentation. Experiments reveal that state-of-the-art large language models exhibit substantial performance gaps on these controlled tasks, underscoring both the diagnostic precision of LogiNumSynth and its utility as a high-quality, semantically grounded training data source.
📝 Abstract
Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training. We present LogiNumSynth, a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g., arithmetic computation). LogiNumSynth supports fine-grained control over reasoning world richness, logical reasoning depth, and the complexity of numerical computations, enabling flexible data synthesis across difficulty levels. We demonstrate three key contributions: (1) Synthesizer -- synthesizing fully controllable joint reasoning tasks over natural language; (2) Evaluation & Process Analysis -- evaluating both process accuracy and answer accuracy; (3) Targeted Training -- using synthesized data to enhance LLMs' reasoning performance. Experiments with multiple LLMs highlight persistent weaknesses in logical-numerical reasoning, showing that LogiNumSynth can serve as both a diagnostic tool and a source of targeted supervision for advancing integrated reasoning skills.