🤖 AI Summary
Supervised learning is hindered by the high cost and time consumption of acquiring high-quality labeled data, while current large language models (LLMs) yield automated annotations significantly inferior to human performance. This paper proposes ACT, a critical-thinking–inspired automatic annotation framework that uniquely employs multimodal LLMs both as annotators and reviewers. ACT introduces introspective error detection to identify high-risk samples, enabling human reviewers to focus selectively and establishing an efficient human-in-the-loop annotation paradigm. Key contributions include: (1) seven annotation quality optimization principles; (2) theoretical convergence guarantees for models trained on ACT-annotated data; and (3) a loss-function correction mechanism. Experiments across multiple domains show that models trained on ACT-annotated data achieve performance within <2% of fully human-labeled baselines, while reducing human annotation effort by up to 90%, demonstrating ACT’s effectiveness and broad applicability.
📝 Abstract
Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most"suspicious"cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.