🤖 AI Summary
To address the low efficiency and poor robustness of large language models (LLMs) in multi-step reasoning involving tool invocation, this paper proposes Chain-of-Abstraction (CoA), a novel paradigm that decouples reasoning planning from tool execution: it first generates an abstract reasoning chain containing symbolic placeholders, then parallelizes the concretization of these placeholders via external tool calls. CoA enables cross-domain generalization and improves out-of-distribution (OOD) robustness, and—crucially—enables the first known parallelization of tool invocation with autoregressive text decoding. Built upon a two-stage LLM training framework (abstract chain generation followed by tool-driven concretization), CoA supports heterogeneous tools such as mathematical computation and web search. Experiments demonstrate average QA accuracy gains of 6.0 percentage points on mathematical reasoning and Wiki QA benchmarks, 1.4× higher tool invocation efficiency over baselines, and significant reduction in end-to-end inference latency.
📝 Abstract
To achieve faithful reasoning that aligns with human expectations, large language models (LLMs) need to ground their reasoning to real-world knowledge (e.g., web facts, math and physical rules). Tools help LLMs access this external knowledge, but there remains challenges for fine-tuning LLM agents (e.g., Toolformer) to invoke tools in multi-step reasoning problems, where inter-connected tool calls require holistic and efficient tool usage planning. In this work, we propose a new method for LLMs to better leverage tools in multi-step reasoning. Our method, Chain-of-Abstraction (CoA), trains LLMs to first decode reasoning chains with abstract placeholders, and then call domain tools to reify each reasoning chain by filling in specific knowledge. This planning with abstract chains enables LLMs to learn more general reasoning strategies, which are robust to shifts of domain knowledge (e.g., math results) relevant to different reasoning questions. It also allows LLMs to perform decoding and calling of external tools in parallel, which avoids the inference delay caused by waiting for tool responses. In mathematical reasoning and Wiki QA domains, we show that our method consistently outperforms previous chain-of-thought and tool-augmented baselines on both in-distribution and out-of-distribution test sets, with an average ~6% absolute QA accuracy improvement. LLM agents trained with our method also show more efficient tool use, with inference speed being on average ~1.4x faster than baseline tool-augmented LLMs.