🤖 AI Summary
Users often over-rely on large language models (LLMs) in simple tasks (e.g., arithmetic) due to their strong performance on complex ones (e.g., poetry generation), misjudging reliability. Existing methods—using embedding clustering to identify LLM failure modes and teach users—show limited effectiveness.
Method: We conduct the first empirical validation of the groupability and teachability of systematic LLM failure patterns. Introducing a novel paradigm for instructional efficacy—user accuracy in *anticipating* LLM errors—we replace traditional human-AI collaboration accuracy metrics. Using meta-label grouping, embedding clustering, prompt engineering, and controlled user studies, we evaluate current automated failure detection and instruction approaches.
Contribution/Results: We find that state-of-the-art automatic failure discovery lacks stability; critically, our new teaching paradigm significantly improves users’ error anticipation accuracy (p < 0.01), providing both theoretical grounding and practical pathways for reliable human-LLM collaboration.
📝 Abstract
People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won't stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why.
We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM's predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM's failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user's ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.