UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current large language models in complex general-purpose reasoning tasks, which stem from insufficient capabilities in multi-step logic, planning, and verification, as well as a lack of large-scale, high-quality, and difficulty-stratified training data. The authors propose a code-based problem-solving framework that decouples logical reasoning cores from natural language expression, enabling the automatic generation of synthetic data spanning hundreds of task types across ten difficulty levels. To mitigate reward sparsity and the non-negative reward trap, they introduce a Bipolar Floating-point Reward (BFR) mechanism. Experimental results demonstrate that combining task diversity with difficulty-matched training strategies significantly enhances both reasoning performance and training efficiency, effectively guiding models toward globally optimal logical solutions.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.
Problem

Research questions and friction points this paper is trying to address.

complex reasoning
large-scale data
difficulty calibration
reward sparsity
logical flaws
Innovation

Methods, ideas, or system contributions that make the work stand out.

UltraLogic
Code-based Solving
Bipolar Float Reward
difficulty-calibrated data
reasoning enhancement
🔎 Similar Papers
No similar papers found.
Y
Yile Liu
Hunyuan, Tencent
Y
Yixian Liu
Hunyuan, Tencent
Z
Zongwei Li
Hunyuan, Tencent
Y
Yufei Huang
Hunyuan, Tencent
X
Xinhua Feng
Hunyuan, Tencent
Z
Zhichao Hu
Hunyuan, Tencent
J
Jinglu Hu
Waseda University
J
Jianfeng Yan
Hunyuan, Tencent
F
Fengzong Lian
Hunyuan, Tencent
Yuhong Liu
Yuhong Liu
Santa Clara University
Trustworthy AISecurity and PrivacyIoTBlockchainSocial network