Addressing the Ecological Fallacy in Larger LMs with Human Context

πŸ“… 2026-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the ecological fallacy arising from large language models (LLMs) that neglect linguistic dependencies across multiple texts by the same author during training and inference. To mitigate this, the authors introduce the Human Language Modeling (HuLM) task, which extends author-context modeling to an 8B-scale Llama model for the first time. By leveraging continual HuLM pretraining followed by HuLM-aware fine-tuning (HuFT)β€”a strategy that incorporates only author context during fine-tuningβ€”the approach outperforms standard methods. The framework integrates QLoRA for parameter-efficient adaptation, temporal modeling of author-specific text sequences, and a linear classifier head. Evaluated across eight downstream tasks, the method consistently improves performance and significantly enhances cross-task generalization, demonstrating the efficacy of explicitly modeling author-level linguistic patterns in LLMs.

Technology Category

Application Category

πŸ“ Abstract
Language model training and inference ignore a fundamental linguistic fact -- there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textit{ecological fallacy} can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author's language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author's language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textit{HuFT:Human-aware Fine-Tuning}). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.
Problem

Research questions and friction points this paper is trying to address.

ecological fallacy
author context
language modeling
human-aware modeling
text dependence
Innovation

Methods, ideas, or system contributions that make the work stand out.

ecological fallacy
human-aware language modeling
author context
HuLM
HuFT
πŸ”Ž Similar Papers
No similar papers found.
Nikita Soni
Nikita Soni
PhD Student, Stony Brook University
D
Dhruv Vijay Kunjadiya
Department of Computer Science, Stony Brook University
P
Pratham Piyush Shah
Department of Computer Science, Stony Brook University
D
Dikshya Mohanty
Department of Computer Science, Stony Brook University
H. Andrew Schwartz
H. Andrew Schwartz
Computer Science & Psychology, Stony Brook University
natural language processinghuman centered AIcomputational psychologyhealth informatics
Niranjan Balasubramanian
Niranjan Balasubramanian
Assistant Professor, Computer Science, Stony Brook University
Natural Language Processing