Crossing the NL/PL Divide: Information Flow Analysis Across the NL/PL Boundary in LLM-Integrated Code

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the challenge that existing program analyses struggle to bridge the semantic gap between natural language (NL) and programming language (PL), leading to ineffective dataflow analysis in code involving large language model (LLM) invocations. The paper proposes the first cross-modal information flow analysis framework that spans the NL/PL boundary, introducing a taxonomy of 24 labels to characterize the degree of information preservation and output modality between LLM inputs and outputs. Building on quantitative information flow theory, the approach integrates manual annotation, Cohen’s κ validation, two-phase taint propagation, and backward program slicing to enable computable modeling of LLM call behaviors. Evaluation on 353 expert-annotated samples achieves an F₁ score of 0.923, successfully detects six real-world prompt injection attacks, reduces average program slice size by 15% for non-propagating placeholder files, and identifies four critical blocking label categories.

Technology Category

Application Category

📝 Abstract

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen's $κ= 0.82$ and near-complete coverage (0.01\% unclassifiable). We demonstrate the taxonomy's utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15\% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.

Problem

Research questions and friction points this paper is trying to address.

NL/PL boundary

information flow

LLM-integrated code

program analysis

dataflow tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

information flow analysis

NL/PL boundary

LLM-integrated code