Retrieval-augmented reasoning with lean language models

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving model lightweighting, data privacy preservation, and robust explanatory capability for complex domain queries in resource-constrained and high-security settings, this paper proposes a retrieval-augmented reasoning (RAR) framework tailored for small language models. Our method integrates dense retrieval with reasoning-aware fine-tuning, leveraging large-model-generated reasoning traces and synthetic queries to construct high-quality training data, and incorporates document summarization-based compression to enhance training efficiency. Built upon a locally deployed, lightweight fine-tuned Qwen2.5-Instruct model and a local dense retriever, the framework enables fully on-device RAR inference. Experiments on the NHS medical knowledge base demonstrate significant improvements in answer accuracy and consistency—approaching the performance of large models while outperforming existing lightweight baselines. The code and models are fully open-sourced, supporting cross-domain adaptability and reproducibility.

Technology Category

Application Category

📝 Abstract
This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.
Problem

Research questions and friction points this paper is trying to address.

Combining reasoning and retrieval in lean language models
Addressing privacy and performance in resource-constrained environments
Improving accuracy in domain-specific queries with lightweight models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines reasoning and retrieval in lean model
Uses synthetic query generation and reasoning traces
Integrates dense retriever with fine-tuned models
🔎 Similar Papers
No similar papers found.