ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing LLM tool-use evaluations predominantly focus on short-context interactions, failing to assess robustness under realistic long-horizon dialogues. Method: We introduce ToolHaystack—the first stress-testing benchmark for long-term tool-augmented dialogue—constructed from real-world API interaction traces to generate multi-task, multi-stage conversational flows. It systematically injects session-level perturbations, including tool failures, intent drift, and context dilution, enabling dynamic evaluation of long-range dependency modeling, context retention, and interference resilience. Contribution/Results: ToolHaystack supports automated, cross-model benchmarking; evaluation across 14 state-of-the-art models reveals an average performance drop of 37.2%, exposing critical deficits in long-horizon tool orchestration. This work fills a fundamental gap in long-interaction assessment and establishes a reproducible, high-stakes standard for evaluating LLM tool robustness.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Assessing tool-augmented LLMs in long-term interactions

Evaluating context maintenance amid realistic disruptions

Identifying robustness gaps in current tool-use benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ToolHaystack benchmark for long-term interactions

Tests tool use with multiple tasks and noise

Reveals gaps in long-term robustness of LLMs

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?