🤖 AI Summary
This study investigates the feasibility of deploying large language models (LLMs) to replace human experts in high-stakes, domain-specific text annotation tasks—specifically in finance, biomedicine, and law—where accuracy and domain fidelity are critical.
Method: We propose a multi-agent deliberation framework that emulates expert consensus-building, integrating chain-of-thought prompting, self-consistency reasoning, and advanced reasoning models (e.g., o3-mini) for collaborative annotation.
Contribution/Results: To our knowledge, this is the first empirical, cross-domain comparative analysis of LLMs’ annotation capabilities in specialized domains. Results reveal limited performance gains from reasoning enhancements; several models exhibit judgment rigidity during multi-agent interaction; and reasoning-augmented models do not significantly outperform non-reasoning baselines. Collectively, these findings provide critical empirical evidence delineating the reliability boundaries of LLMs in high-assurance professional annotation tasks.
📝 Abstract
Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others' annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.