On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Investigative journalists routinely process vast document collections, yet existing RAG systems face deployment barriers in newsrooms due to hallucination, high verification overhead, and privacy risks. This paper proposes a five-stage, fully localized retrieval framework tailored for investigative journalism. It enables end-to-end on-premises deployment using quantized small language models (Gemma-3 12B, Qwen-3 14B, GPT-OSS 20B), with memory consumption ≤24 GB. The framework ensures auditability via explicit citation chains, integrates corpus summarization, parallel task scheduling, and multi-stage quality assessment to balance efficiency and controllability. Its core contribution is a journalist-centric design paradigm that achieves a tripartite equilibrium—privacy preservation, human-in-the-loop intervention, and verifiable reliability—under resource constraints. Experiments demonstrate significant reductions in hallucination rates and manual verification effort; however, performance degrades with training-data overlap, necessitating human–AI collaborative validation.

Technology Category

Application Category

📝 Abstract

Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.

Problem

Research questions and friction points this paper is trying to address.

Addressing data privacy and hallucination risks in newsroom AI

Evaluating small language models for investigative document search

Developing transparent AI systems with human oversight for journalism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Locally deployable small language models for data security

Five-stage pipeline ensuring transparency and editorial control

Explicit citation chains maintaining complete auditability for journalists

🔎 Similar Papers

SoK: Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency