🤖 AI Summary
This work addresses the challenge of dynamically evaluating the safety of skill invocations by autonomous language model agents at runtime, where static auditing fails to capture context-dependent risks. The paper proposes the first request-conditioned dynamic auditing framework, integrating static capability priors, a context-aware risk scorer, and a calibrated risk fusion strategy to predict continuous risk scores prior to invocation, thereby enabling informed intervention decisions. The authors contribute SIA-Bench, a novel benchmark comprising contextual information, lineage metadata, and continuous risk labels, and demonstrate the efficacy of dynamic auditing in high-risk scenarios: on the held-out test set, the calibrated fusion strategy achieves an AUPRC of 0.439, significantly outperforming both the context-only scorer (0.405) and the strongest static baseline (0.380).
📝 Abstract
Autonomous language-model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous-risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied. We introduce STARS, which combines a static capability prior, a request-conditioned invocation risk model, and a calibrated risk-fusion policy. To evaluate this setting, we construct SIA-Bench, a benchmark of 3,000 invocation records with group-safe splits, lineage metadata, runtime context, canonical action labels, and derived continuous-risk targets. On a held-out split of indirect prompt injection attacks, calibrated fusion reaches 0.439 high-risk AUPRC, improving over 0.405 for the contextual scorer and 0.380 for the strongest static baseline, while the contextual scorer remains better calibrated with 0.289 expected calibration error. On the locked in-distribution test split, gains are smaller and static priors remain useful. The resulting claim is therefore narrower: request-conditioned auditing is most valuable as an invocation-time risk-scoring and triage layer rather than as a replacement for static screening. Code is available at https://github.com/123zgj123/STARS.