🤖 AI Summary
This work addresses the challenge of improving large language models’ (LLMs) generalization from synthetic graph data to real-world implicit graph-structured tasks—e.g., multi-hop question answering and structured planning. Unlike conventional supervised fine-tuning, we propose the first reinforcement learning (RL)-based framework for implicit graph reasoning generalization. Our method introduces a dual reward mechanism—process-oriented and result-oriented—to mitigate overfitting and enhance cross-domain adaptability; integrates alignment algorithms (e.g., GRPO, DPO) with solution-space modeling and process supervision, enabling plug-and-play deployment for both pretrained and synthetic-data-finetuned LLMs. Evaluated on five benchmarks, our approach achieves an average 12.9% improvement (p < 0.01), with process rewards consistently outperforming result rewards. Hybrid training on synthetic and real data further boosts performance. Key contributions include: (1) the first RL training paradigm tailored for implicit graph reasoning generalization, and (2) an interpretable, robust process supervision mechanism.
📝 Abstract
Previous research has sought to enhance the graph reasoning capabilities of LLMs by supervised fine-tuning on synthetic graph data. While these led to specialized LLMs better at solving graph algorithm problems, we don't need LLMs for shortest path: we need generalization from synthetic graph data to real-world tasks with implicit graph structures. In this work, we propose to unlock generalizable learning of graph synthetic data with reinforcement learning. We first design solution-based and process-based rewards for synthetic graph problems: instead of rigid memorizing response patterns in direct fine-tuning, we posit that RL would help LLMs grasp the essentials underlying graph reasoning and alleviate overfitting. We employ RL algorithms such as GRPO and DPO, aligning both off-the-shelf LLMs and LLMs fine-tuned on synthetic graph data. We then compare them against existing settings on both in-domain synthetic tasks and out-of-domain real-world tasks with implicit graph structures such as multi-hop QA, structured planning, and more. Extensive experiments demonstrate that our RL recipe leads to statistically significant improvement on 5 datasets, with an average gain of 12.9% over baseline settings. Further analysis reveals that process-based rewards consistently outperform solution-based rewards, mixing synthetic and real-world task data yields potential gains, while compositionality and explainable intermediate steps remains a critical challenge even after RL.