KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing KBQA methods rely on process-supervised fine-tuning of LLMs, suffering from insufficient exploration incentives and limited improvement in agent reasoning capabilities. This paper proposes the first result-only supervised, multi-stage curriculum reinforcement learning framework for KBQA. It first employs result-guided rejection sampling to generate high-quality reasoning trajectories without requiring step-by-step human annotations; then introduces a difficulty-structured curriculum RL fine-tuning strategy—from easy to hard—to enhance autonomous exploration and structured reasoning. Crucially, the method eliminates dependence on manually annotated reasoning processes. On the GrailQA zero-shot subset, it achieves a 11.1% absolute improvement over prior SOTA, and surpasses all previous methods using only 1/12 of their training data. Significant gains are also observed on WebQSP and ComplexWebQuestions, demonstrating the effectiveness and generalizability of result-supervised agent reasoning.

Technology Category

Application Category

📝 Abstract
Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Improving KBQA through agentic reasoning with outcome-only supervision
Addressing weak exploration incentives in process-supervised reasoning methods
Overcoming reward sparsity in outcome supervision via curriculum reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses outcome-only supervision for training
Applies multi-stage curriculum reinforcement learning
Fine-tunes LLM with easy-to-hard reward schedules
Z
Zhuo Chen
Institute of Computing Technology, Chinese Academy of Sciences
F
Fei Wang
Institute of Computing Technology, Chinese Academy of Sciences
Zixuan Li
Zixuan Li
Assistant Professor at ICT, UCAS
Knowledge GraphLarge Language Model
Z
Zhao Zhang
Institute of Computing Technology, Chinese Academy of Sciences
W
Weiwei Ding
Institute of Computing Technology, Chinese Academy of Sciences
Chuanguang Yang
Chuanguang Yang
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionKnowledge DistillationRepresentation Learning
Y
Yongjun Xu
Institute of Computing Technology, Chinese Academy of Sciences
Xiaolong Jin
Xiaolong Jin
Purdue University
AI safety
Jiafeng Guo
Jiafeng Guo
Professor, Institute of Computing Techonology, CAS
Information RetrievalMachine LearningText AnalysisNeuIR