🤖 AI Summary
This study addresses the longstanding absence of structured corpora for improvised poetry in the Logudorese variety of Sardinian, a gap that has hindered both computational linguistic analysis and the preservation of oral traditions. The authors present A Bolu, the first structured corpus of this poetic form, comprising 2,835 stanzas and 141,321 tokens. Through descriptive statistics and computational linguistic methods, they conduct a multidimensional textual analysis and provide the first empirical evidence of significant formulaic repetition patterns in this oral poetry. These findings offer robust support for the Parry-Lord theory of oral-formulaic composition while advancing inclusive development of NLP resources for minority languages and opening new avenues for understanding oral creativity.
📝 Abstract
The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.