KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing KBQA methods rely on process-supervised fine-tuning of LLMs, suffering from insufficient exploration incentives and limited improvement in agent reasoning capabilities. This paper proposes the first result-only supervised, multi-stage curriculum reinforcement learning framework for KBQA. It first employs result-guided rejection sampling to generate high-quality reasoning trajectories without requiring step-by-step human annotations; then introduces a difficulty-structured curriculum RL fine-tuning strategy—from easy to hard—to enhance autonomous exploration and structured reasoning. Crucially, the method eliminates dependence on manually annotated reasoning processes. On the GrailQA zero-shot subset, it achieves a 11.1% absolute improvement over prior SOTA, and surpasses all previous methods using only 1/12 of their training data. Significant gains are also observed on WebQSP and ComplexWebQuestions, demonstrating the effectiveness and generalizability of result-supervised agent reasoning.

Technology Category

Application Category

📝 Abstract

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Improving KBQA through agentic reasoning with outcome-only supervision

Addressing weak exploration incentives in process-supervised reasoning methods

Overcoming reward sparsity in outcome supervision via curriculum reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses outcome-only supervision for training

Applies multi-stage curriculum reinforcement learning

Fine-tunes LLM with easy-to-hard reward schedules

🔎 Similar Papers

CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning