ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of end-to-end modeling of cognitive processes in academic writing. We introduce the first longitudinal, real-world academic writing dataset—captured over a four-month period within authentic LaTeX environments—comprising keystroke-level logs (62,000 events) and fine-grained, expert-annotated cognitive intents (e.g., ideation, revision, citation, formatting) across five preprints. Moving beyond static final-document analysis, our approach systematically characterizes dynamic cognitive states at each writing step. Methodologically, we integrate keystroke logging, iterative multi-round cognitive intent annotation, and privacy-preserving structured data curation. Experimental results demonstrate that leveraging this process-oriented data significantly improves AI writing assistants’ performance on iterative editing and intent-aware assistance tasks. The full dataset, interactive demonstration system, and implementation code are publicly released, establishing foundational infrastructure for next-generation adaptive academic AI tools.

Technology Category

Application Category

📝 Abstract
Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers' cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, the first-of-its-kind keystroke logs of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing) for the future development of AI writing assistants for academic research, which necessitate complex methods beyond LLM prompting. Our experiments clearly demonstrated the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified dataset, demo, and code repository are available on our project page.
Problem

Research questions and friction points this paper is trying to address.

Decode end-to-end scholarly writing process data
Understand cognitive mechanisms in academic writing
Develop AI assistants for scientific cognitive thinking
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end keystroke logs
Cognitive writing intentions annotations
LaTeX-based scholarly writing dataset
🔎 Similar Papers
No similar papers found.