Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

High-quality, verifiable data for complex reasoning tasks—such as mathematics and programming—is scarce, and human supervision remains prohibitively expensive. Method: This paper introduces LoongFrame, the first open-source, multi-domain synthetic data framework for reasoning, covering 12 challenging tasks. It comprises LoongBench—a seed benchmark of 8,729 high-quality samples—and LoongEnv, a scalable generation environment. LoongEnv employs LLM-driven chain-of-thought generation, automated code executability verification, modular prompting strategies, and an agent-environment interaction architecture to establish a closed-loop “generate–verify–train” pipeline. Contribution/Results: Integrating human review with automated evaluation, empirical results demonstrate substantial performance gains across diverse complex reasoning tasks. LoongFrame establishes a reproducible, scalable paradigm for low-supervision reinforcement learning, enabling efficient model training without extensive human annotation.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.

Problem

Research questions and friction points this paper is trying to address.

Extending verifiable reasoning to diverse domains beyond mathematics

Addressing scarcity of high-quality verifiable reasoning datasets

Reducing high cost of human supervision in reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source framework for scalable synthetic data generation

Modular environment supporting multiple prompting strategies

Agent-environment loop enabling reinforcement learning with verifiable rewards

🔎 Similar Papers

No similar papers found.