Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

This work addresses the limitations of existing retrieval-augmented methods, which rely on lexical or embedding similarity and struggle to retrieve the precise knowledge required for multi-step data reasoning tasks. To overcome this, the authors propose SGKR, a novel framework that leverages function call dependency graphs as the core structure for knowledge organization and retrieval. SGKR extracts semantic input–output labels from a given problem, identifies dependency paths between them, constructs a task-relevant subgraph, and integrates the associated knowledge and code implementations into a structured context for large language model–based code generation. Experimental results demonstrate that SGKR significantly outperforms both non-retrieval and similarity-based retrieval baselines on multi-step data analysis benchmarks, effectively enhancing the problem-solving accuracy of large language models and programming agents.

Technology Category

Application Category

📝 Abstract

Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

Problem

Research questions and friction points this paper is trying to address.

knowledge retrieval

multi-step reasoning

code dependencies

structure-grounded

domain-specific data analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-Grounded Retrieval

Code Dependency Graph

Multi-Step Reasoning