IntenTest: Stress Testing for Intent Integrity in API-Calling LLM Agents

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

To address the problem of LLM agents deviating from user goals due to intent misinterpretation when executing natural language–driven API calls—especially amid continuous API evolution—this paper proposes the first semantic-partitioning–guided intent-integrity stress-testing framework. Methodologically, it integrates targeted mutation grounded in semantic partitioning and intent preservation, data-type–aware policy memory, and a lightweight error-proneness predictor, enabling generalizable testing across diverse LLMs and API versions. Key contributions include: (i) the first semantic-partitioning–based mutation mechanism; (ii) a datatype-aware policy memory scheme; and (iii) an efficient error-tendency predictor. Evaluated on 80 API toolkits, our framework achieves up to a 3.2× improvement in error exposure rate and a 2.8× gain in query efficiency. Notably, it enables small models to generate high-quality test cases effective for large models.

Technology Category

Application Category

📝 Abstract

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent's actions that diverge from the user's intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce IntenTest, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, IntenTest generates realistic tasks based on toolkits' documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, IntenTest maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that IntenTest effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, IntenTest generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

Problem

Research questions and friction points this paper is trying to address.

Detects intent misinterpretation in API-calling LLM agents

Generates realistic tasks to expose subtle agent errors

Improves testing efficiency with semantic partitioning and mutation ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

API-centric stress testing framework

Semantic partitioning for task categorization

Datatype-aware strategy memory adaptation

🔎 Similar Papers

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls