Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Large language models (LLMs) suffer from overgeneralization and factual hallucination when unsupervisedly exploring large-scale domain-specific corpora, exacerbated by context-window limitations—termed the “haystack description dilemma.” This paper systematically identifies and characterizes this phenomenon for the first time. We propose a human-in-the-loop generative paradigm wherein lightweight human feedback dynamically calibrates LLM outputs during topic discovery. We conduct rigorous human evaluations and knowledge acquisition quantification experiments on real-world domain corpora, benchmarking against unsupervised and supervised LLM-based topic modeling as well as classical LDA. Results show that purely LLM-generated topics exhibit high readability but severe overgeneralization; integrating minimal human feedback substantially improves exploration fidelity and domain relevance; while LDA remains robust, it lacks interactivity and adaptability. Our work establishes human-AI collaboration as a critical pathway for deploying LLMs in domain-document exploration, advancing trustworthy, interpretable, and human-centered AI-assisted analysis.

Technology Category

Application Category

📝 Abstract

A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with domain-specific corpus understanding

Human supervision improves LLM-generated topic specificity

Traditional models effective but less user-friendly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop LLM evaluation

Mitigating hallucination with supervision

Comparing LLM and traditional topic models

🔎 Similar Papers

No similar papers found.