🤖 AI Summary
This study addresses the challenges in evaluating large language model (LLM) applications—namely, high output stochasticity, multidimensionality, and sensitivity to prompt and model variations—which render traditional testing methods inadequate. The authors propose an evaluation-driven engineering workflow (Define-Test-Diagnose-Fix) and introduce the first hierarchical Minimum Viable Evaluation Suite (MVES) tailored for general-purpose LLMs, retrieval-augmented generation (RAG), and agent-based tool-use scenarios. The framework integrates automated checks, human scoring, and LLM-as-judge to establish a reproducible local evaluation system, validated on the Ollama platform using Llama 3 8B and Qwen 2.5 7B Instruct models. Experiments reveal that while generic prompt templates enhance instruction following, they degrade structured extraction accuracy from 100% to 90% and RAG compliance from 93.3% to 80%, underscoring the necessity of evaluation-driven iteration and advocating systematic assessment over heuristic prompt engineering.
📝 Abstract
Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic"improved"prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.