🤖 AI Summary
This study addresses the limited capability of current large language models to handle ambiguous or contradictory user requests in scientific contexts, where problems are often ill-defined. The authors introduce the first multi-turn clarification benchmark tailored to computational science—encompassing fluid dynamics, solid mechanics, materials science, and partial differential equations—and propose an evaluation framework integrating structured task ontologies with scoring rules. This framework assesses upstream dialog reasoning along three dimensions: clarification behavior, conversational grounding, and fidelity to task specifications. Experimental results reveal that while state-of-the-art models perform relatively well in resolving contradictions, they achieve only a 52.7% success rate in clarifying ambiguities in fluid dynamics and frequently exhibit unarticulated assumptions or context-detached implicit corrections.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.