🤖 AI Summary
Existing benchmarks for evaluating large language models (LLMs) in decision-making tasks often overlook the combinatorial structure of actions and explicit feasibility constraints, limiting their ability to capture the complexity of real-world decision scenarios. This work proposes the first conditional decision evaluation benchmark tailored to combinatorial action spaces, modeling actions as assignments to decision variables and incorporating explicit constraints at the variable, context, and assignment levels. By leveraging structured action representations and an oracle-based automated evaluation mechanism, the benchmark uniquely integrates combinatorial action spaces with multi-level constraints within decision assessment. This approach overcomes the limitations of conventional methods that rely on restricted candidate action sets and unconditional assumptions, enabling a more rigorous and realistic evaluation of LLMs’ decision-making capabilities under complex, constrained environments.
📝 Abstract
Large language models have been widely explored as decision-support tools in high-stakes domains due to their contextual understanding and reasoning capabilities. However, existing decision-making benchmarks rely on two simplifying assumptions: actions are selected from a finite set of pre-defined candidates, and explicit conditions restricting action feasibility are not incorporated into the decision-making process. These assumptions fail to capture the compositional structure of real-world actions and the explicit conditions that constrain their validity. To address these limitations, we introduce CONDESION-BENCH, a benchmark designed to evaluate conditional decision-making in compositional action space. In CONDESION-BENCH, actions are defined as allocations to decision variables and are restricted by explicit conditions at the variable, contextual, and allocation levels. By employing oracle-based evaluation of both decision quality and condition adherence, we provide a more rigorous assessment of LLMs as decision-support tools.