🤖 AI Summary
This study addresses the scarcity of high-quality annotated data for German aspect-based sentiment analysis (ABSA) and the unclear impact of annotation sources on model performance. It presents the first systematic comparison of annotation quality among experts, students, crowdworkers, and large language models (LLMs) in the German ABSA context. The authors construct a gold-standard dataset through expert re-annotation and evaluate the effectiveness of each annotation type on two core tasks: aspect category sentiment analysis (ACSA) and aspect term and sentiment detection (TASD). Leveraging state-of-the-art models—including BERT, T5, and LLaMA—with both fine-tuning and instruction-based prompting, the experiments demonstrate that expert annotations yield significantly higher consistency and downstream task performance. The study also quantifies the trade-offs of using LLM-generated and non-expert annotations under resource-constrained conditions, highlighting their practical feasibility alongside inherent limitations.
📝 Abstract
Aspect-Based Sentiment Analysis (ABSA) enables fine-grained opinion analysis by identifying sentiments toward specific aspects or targets within a text. While ABSA has been widely studied for English, research on other languages such as German remains limited, largely due to the lack of high-quality annotated datasets. This paper examines how different annotation sources influence the development of German ABSA. To this end, an existing dataset is re-annotated by experts to establish a ground truth, which serves as a reference for evaluating annotations produced by students, crowdworkers, Large Language Models (LLMs), and experts. Annotation quality is compared using Inter-Annotator Agreement (IAA) and its impact on downstream model performance for different ABSA subtasks. The evaluation focuses on Aspect Category Sentiment Analysis (ACSA) and Target Aspect Sentiment Detection (TASD). We apply State-of-the-Art (SOTA) methods for ABSA, including BERT-, T5-, and LLaMA-based approaches to assess performance differences, spanning fine-tuning and in-context learning with instruction prompts. The findings provide practical insights into trade-offs between annotation reliability and efficiency, offering guidance for dataset construction in under-resourced Natural Language Processing (NLP) scenarios.