🤖 AI Summary
Current evaluations of large language models (LLMs) on ill-defined tasks—such as complex instruction following and natural language-to-Mermaid sequence diagram generation—suffer from insufficient coverage, sensitivity to phrasing, incomparable metrics, and instability in LLM-based judging, thereby failing to yield reliable or diagnostic assessment signals. This work presents the first systematic analysis of confounding failure modes in such tasks, integrating case studies, failure mode categorization, and a multidimensional evaluation framework to demonstrate how existing benchmarks often conflate distinct error types, leading to distorted scores. Moving beyond monolithic aggregate metrics, the proposed approach delivers actionable, fine-grained insights that lay both theoretical and practical foundations for building more robust and interpretable evaluation systems.
📝 Abstract
Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. Our findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate more robust, interpretable evaluation designs.