Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Small-scale language models struggle to perform effective multi-hop reasoning under resource constraints, facing challenges such as sparse exploration, difficult credit assignment, and training instability. To address these issues, this work proposes the DAVID-GRPO framework, which integrates minimal supervisory signals to stabilize early learning, an evidence-driven retrieval-based credit assignment mechanism, and a resampling strategy that truncates failed trajectories—collectively introducing tailored inductive biases for small models. Evaluated on a 1.5B-parameter model trained with only four RTX 3090 GPUs, the approach consistently outperforms existing reinforcement learning methods designed for large-scale settings across six multi-hop question answering benchmarks. This represents the first demonstration of efficient multi-hop reasoning under extremely limited computational budgets, breaking the conventional trade-off between low cost and low accuracy.

Technology Category

Application Category

📝 Abstract

While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.

Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning

resource-constrained agents

small language models

reinforcement learning

credit assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning

resource-constrained agents

reinforcement learning