🤖 AI Summary
This work addresses the significant latency incurred by multi-hop retrieval agents due to repeated calls to external tools. The authors propose SpecHop, a continuous speculation framework that accelerates inference without altering the final decision trajectory. By concurrently executing multiple threads based on fast but unreliable speculative tools and integrating mechanisms for asynchronous verification, correct-branch commitment, and erroneous-branch rollback, SpecHop achieves lossless speedup. The study establishes a theoretical foundation for lossless speculation in multi-hop tool-use scenarios and designs a continuous speculation strategy that approaches the theoretically optimal latency reduction. Experimental results demonstrate that SpecHop preserves the original accuracy while reducing end-to-end latency by up to 40%, closely matching the predicted theoretical acceleration bound.
📝 Abstract
Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: https://github.com/mehrdadsaberi/spechop