so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Whitespace in poetry serves as a critical formal and semantic carrier, yet it remains systematically neglected in NLP modeling. Method: Leveraging a corpus of 19,000 English poems, we conduct a diachronic, genre- and source-aware analysis of whitespace distribution and quantify the representational distortion introduced by standard preprocessing (e.g., tokenization, whitespace removal). We further benchmark whitespace patterns across LLM-generated poems, online community poems, and published poetry. Results: Standard formatting-agnostic pretraining introduces significant representational bias; LLM outputs and crowd-sourced poems exhibit statistically divergent whitespace structures compared to published poetry—revealing fundamental deficits in models’ acquisition of poetic spatial syntax. To address this, we propose a “format-preserving” preprocessing principle and release PoemSpace, the first publicly available, manually verified English poetry dataset (2,800 poems) retaining original formatting. This work establishes foundational methodology and data infrastructure for computational poetry research and LLMs’ literary capability development.

Technology Category

Application Category

📝 Abstract

Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem's whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.

Problem

Research questions and friction points this paper is trying to address.

Analyzing whitespace usage in human versus LLM-generated poetry

Investigating whitespace variations across poetic forms and time periods

Evaluating text processing methods for preserving poetic whitespace integrity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes whitespace patterns in 19k published poems

Compares human and LLM-generated poetry whitespace usage

Evaluates whitespace representation in LLM pretraining datasets

🔎 Similar Papers

Word Boundary Information Isn’t Useful for Encoder Language Models