LLMs for automatic annotation of Mandarin narrative transcripts

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the high cost of manual annotation for macro-level story grammar structures in spoken Chinese narratives and the lack of efficient automated solutions for non-English languages. It presents the first systematic evaluation of large language models (LLMs) for automatic story grammar annotation in Chinese, leveraging the MAIN assessment framework and structured prompt engineering. Four leading LLMs were deployed locally and via cloud APIs and evaluated against human annotations. The best-performing model achieved a Cohen’s κ of 0.794—approaching inter-annotator reliability (κ = 0.872)—while improving annotation efficiency by 65%. The work also releases open-source prompt templates and demonstrates that semantic complexity and discourse variability significantly impact model performance, offering a novel paradigm for analyzing complex spoken discourse in non-English languages.

📝 Abstract

Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

automatic annotation

Mandarin narratives

discourse-level annotation

story grammar

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

discourse-level annotation

Mandarin narratives