ZeroED: Hybrid Zero-shot Error Detection through Large Language Model Reasoning

📅 2025-04-06

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

Addressing the challenges of diverse error types, reliance on manual annotations, and contextual understanding in tabular data error detection, this paper proposes ZeroED, a zero-shot hybrid framework. Methodologically, ZeroED (1) constructs error-cause-aware binary features by fusing pretrained embeddings with statistical features; (2) employs K-means clustering-based sampling and intra-cluster label propagation for unsupervised data selection; and (3) designs a two-stage LLM-based contextual reasoning mechanism, augmented by LLM verification to enhance reliability. Crucially, ZeroED operates without any human-labeled training data. Evaluated on seven public benchmarks, it achieves up to a 30% improvement in F1-score over prior state-of-the-art methods while reducing token consumption by 90%, demonstrating significant gains in both accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Error detection (ED) in tabular data is crucial yet challenging due to diverse error types and the need for contextual understanding. Traditional ED methods often rely heavily on manual criteria and labels, making them labor-intensive. Large language models (LLM) can minimize human effort but struggle with errors requiring a comprehensive understanding of data context. In this paper, we propose ZeroED, a novel hybrid zero-shot error detection framework, which combines LLM reasoning ability with the manual label-based ED pipeline. ZeroED operates in four steps, i.e., feature representation, error labeling, training data construction, and detector training. Initially, to enhance error distinction, ZeroED generates rich data representations using error reason-aware binary features, pre-trained embeddings, and statistical features. Then, ZeroED employs LLM to label errors holistically through in-context learning, guided by a two-step reasoning process for detailed error detection guidelines. To reduce token costs, LLMs are applied only to representative data selected via clustering-based sampling. High-quality training data is constructed through in-cluster label propagation and LLM augmentation with verification. Finally, a classifier is trained to detect all errors. Extensive experiments on seven public datasets demonstrate that, ZeroED substantially outperforms state-of-the-art methods by a maximum 30% improvement in F1 score and up to 90% token cost reduction.

Problem

Research questions and friction points this paper is trying to address.

Detecting diverse errors in tabular data without manual labels

Reducing LLM token costs while maintaining error detection accuracy

Combining LLM reasoning with manual pipelines for hybrid error detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid zero-shot error detection with LLM reasoning

Error labeling via two-step LLM reasoning process

Clustering-based sampling reduces LLM token costs

🔎 Similar Papers

Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore