Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of improving large language models’ (LLMs) generalization from synthetic graph data to real-world implicit graph-structured tasks—e.g., multi-hop question answering and structured planning. Unlike conventional supervised fine-tuning, we propose the first reinforcement learning (RL)-based framework for implicit graph reasoning generalization. Our method introduces a dual reward mechanism—process-oriented and result-oriented—to mitigate overfitting and enhance cross-domain adaptability; integrates alignment algorithms (e.g., GRPO, DPO) with solution-space modeling and process supervision, enabling plug-and-play deployment for both pretrained and synthetic-data-finetuned LLMs. Evaluated on five benchmarks, our approach achieves an average 12.9% improvement (p < 0.01), with process rewards consistently outperforming result rewards. Hybrid training on synthetic and real data further boosts performance. Key contributions include: (1) the first RL training paradigm tailored for implicit graph reasoning generalization, and (2) an interpretable, robust process supervision mechanism.

Technology Category

Application Category

📝 Abstract

Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don't need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph synthetic data with reinforcement learning. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that RL would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting. We employ RL algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on synthetic graph data. We then compare them against existing settings on both in-domain synthetic tasks and out-of-domain real-world tasks with implicit graph structures such as multi-hop QA, structured planning, and more. Extensive experiments demonstrate that our RL recipe leads to statistically significant improvement on 5 datasets, with an average gain of 12.9% over baseline settings. Further analysis reveals that process-based rewards consistently outperform solution-based rewards, mixing synthetic and real-world task data yields potential gains, while compositionality and explainable intermediate steps remains a critical challenge even after RL.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM generalization from synthetic to real-world graph tasks

Using RL to reduce overfitting in graph reasoning learning

Improving performance on implicit graph structure tasks via RL rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enhances graph reasoning generalization

Solution-based and process-based rewards reduce overfitting

GRPO and DPO algorithms align LLMs effectively

🔎 Similar Papers

No similar papers found.