Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the "rise-and-collapse" performance degradation commonly observed in reinforcement learning for knowledge graph tool use, which stems from the absence of natural language feedback in tool interfaces. Through systematic analysis on a minimal Freebase interface with only four verbs, the work demonstrates that the critical bottleneck lies not in erroneous relation selection but in the lack of feedback. Leveraging GRPO-based reinforcement learning, multi-reward design, oracle ablation, and single-turn self-distillation, experiments on Qwen2.5-7B and Qwen2.5-14B models show that self-distillation enables the 7B model to achieve a 40.0% exact match accuracy, while doubling model capacity yields only a marginal 0.25 percentage point gain—indicating that performance is fundamentally constrained by the interface design itself.
📝 Abstract
We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy's tool-grounded answer rate climbs from $3.8\%$ to $9.6\%$ over 250 steps, then collapses to $0\%$ within a single 50-step window -- a \emph{peak-then-collapse} pattern replicated across four seeds. Across seven reward designs, we find four recurring failure modes: adding denser or more targeted proxy rewards shifts the failure mode rather than eliminating it. We argue that a key difference from Python interpreters, web search, and JSON APIs is interface feedback: their failures often leak natural-language signal the model saw in pretraining. A Python traceback names the failing line; an empty Freebase result \texttt{[]} does not. Stripping away that surface exposes a degradation regime that same-family reward redesigns do not fix. A direct oracle ablation rules out relation selection: injecting gold relations at every retrieval call lifts exact-match accuracy by only $+0.20$~pp, and $95.4\%$ of retrieval-dependent errors are retrieval-composition failures rather than answer-extraction failures. As a mitigation, one-iteration self-distillation reaches $40.0\%$ EM at 7B and is capacity-invariant: doubling capacity to 14B improves EM by only $0.25$~pp, and initialization barely matters -- the ceiling appears interface-bound within the 7B--14B range tested.
Problem

Research questions and friction points this paper is trying to address.

peak-then-collapse
knowledge-graph tool use
interface feedback
reinforcement learning
tool-grounded reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

peak-then-collapse
knowledge-graph tool use
interface feedback
self-distillation
retrieval-composition failure