🤖 AI Summary
This work systematically investigates the impact of task framing ambiguity on data science agents, a previously overlooked issue wherein vague task objectives or evaluation metrics lead agents to silently adopt incorrect assumptions, producing outputs misaligned with user intent. The authors introduce two diagnostic benchmarks—targeting ambiguity in prediction and evaluation goals—by generating controlled ambiguous variants from DSBench and MLE-bench, validated for multiple plausible interpretations via human and large language model assessments. Experiments reveal that such ambiguity substantially degrades performance across five agent types; allowing a single clarification opportunity significantly recovers effectiveness. However, agents struggle to accurately identify when clarification is needed, often defaulting silently or over-querying. The study identifies silent misinterpretation as a critical bottleneck and quantifies both the promise and limitations of clarification mechanisms in mitigating framing ambiguity.
📝 Abstract
As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.