🤖 AI Summary
Users face the “blank-page problem” in LLM evaluation—difficulty initiating customized, rigorous assessment workflows. Method: This paper introduces the first end-to-end AI workflow generation agent for behavioral LLM evaluation, built on the ChainForge platform. It automatically synthesizes executable and interpretable evaluation pipelines from natural-language prompts, integrating prompt engineering, workflow orchestration, and human-AI collaborative evaluation. The agent explicitly models and mitigates cognitive bias between subjective confidence and objective quality, formalizing an anti-overreliance design principle. Contribution/Results: A user study demonstrates that AI assistance significantly reduces cognitive workload, increases user confidence, and improves pipeline quality. Expert blind evaluations show a 37% average score improvement, empirically validating the method’s effectiveness and practical utility in real-world LLM assessment.
📝 Abstract
As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-defined tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the"blank page problem."ChainBuddy, an AI workflow generation assistant built into the ChainForge platform, aims to tackle this issue. From a single prompt or chat, ChainBuddy generates a starter evaluative LLM pipeline in ChainForge aligned to the user's requirements. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior and make the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload, felt more confident, and produced higher quality pipelines evaluating LLM behavior. However, we also uncover a mismatch between subjective and objective ratings of performance: participants rated their successfulness similarly across conditions, while independent experts rated participant workflows significantly higher with AI assistance. Drawing connections to the Dunning-Kruger effect, we draw design implications for the future of workflow generation assistants to mitigate the risk of over-reliance.