🤖 AI Summary
Large language model (LLM) agents pose significant security risks—particularly sensitive data leakage—when directly generating and executing code for data analysis. Method: We propose InData, a framework that enforces secure data access via predefined safety-constrained toolchains, thereby preventing LLMs from directly observing raw sensitive data or generating executable code. To rigorously evaluate such secure, multi-step reasoning, we design a novel benchmark—InData—featuring three-tiered difficulty levels and multi-step reasoning tasks centered on composite tool invocation under strict security constraints. Contribution/Results: Evaluating 15 open-source LLMs, we find high accuracy (97.3%) on simple tasks but a sharp decline to 69.6% on advanced multi-step reasoning tasks, exposing a critical bottleneck in current models’ long-horizon reasoning under security restrictions. InData is the first benchmark explicitly targeting secure, multi-step, tool-integrated reasoning—filling a key gap in safety-aware LLM evaluation.
📝 Abstract
Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels--Easy, Medium, and Hard--capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.