InData: Towards Secure Multi-Step, Tool-Based Data Analysis

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) agents pose significant security risks—particularly sensitive data leakage—when directly generating and executing code for data analysis. Method: We propose InData, a framework that enforces secure data access via predefined safety-constrained toolchains, thereby preventing LLMs from directly observing raw sensitive data or generating executable code. To rigorously evaluate such secure, multi-step reasoning, we design a novel benchmark—InData—featuring three-tiered difficulty levels and multi-step reasoning tasks centered on composite tool invocation under strict security constraints. Contribution/Results: Evaluating 15 open-source LLMs, we find high accuracy (97.3%) on simple tasks but a sharp decline to 69.6% on advanced multi-step reasoning tasks, exposing a critical bottleneck in current models’ long-horizon reasoning under security restrictions. InData is the first benchmark explicitly targeting secure, multi-step, tool-integrated reasoning—filling a key gap in safety-aware LLM evaluation.

Technology Category

Application Category

📝 Abstract
Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels--Easy, Medium, and Hard--capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.
Problem

Research questions and friction points this paper is trying to address.

Addressing security risks in LLM data analysis by restricting direct data access
Evaluating multi-step tool-based reasoning for complex data analysis tasks
Benchmarking LLMs' compositional reasoning through difficulty-graded analysis questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Restrict LLMs from direct data access
Use predefined secure verified tools
Introduce InData for multi-step reasoning evaluation
🔎 Similar Papers
No similar papers found.