Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge of compositional multi-hop reasoning in large language models within specialized scientific domains. The authors propose a novel bottom-up learning paradigm that, for the first time, leverages knowledge graphs as implicit reward models to generate verifiable reward signals from 1–3 hop paths. Within a reinforcement learning framework, this approach guides the model to compose fundamental axioms for reasoning rather than merely optimizing final answers. By integrating supervised fine-tuning with reinforcement learning, the method enables scalable supervision for compositional reasoning and substantially improves zero-shot generalization. Evaluated in the medical domain, a 14B-parameter model trained with this approach significantly outperforms GPT-5.2 and Gemini 3 Pro on complex 4–5 hop reasoning tasks and demonstrates strong robustness against adversarial perturbations.

Technology Category

Application Category

📝 Abstract

Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a"compositional bridge", enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.

Problem

Research questions and friction points this paper is trying to address.

compositional reasoning

multi-hop reasoning

knowledge graphs

large language models

scientific reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge graphs as reward models

compositional reasoning

path-derived rewards