🤖 AI Summary
To address the challenge of simultaneously achieving model lightweighting, data privacy preservation, and robust explanatory capability for complex domain queries in resource-constrained and high-security settings, this paper proposes a retrieval-augmented reasoning (RAR) framework tailored for small language models. Our method integrates dense retrieval with reasoning-aware fine-tuning, leveraging large-model-generated reasoning traces and synthetic queries to construct high-quality training data, and incorporates document summarization-based compression to enhance training efficiency. Built upon a locally deployed, lightweight fine-tuned Qwen2.5-Instruct model and a local dense retriever, the framework enables fully on-device RAR inference. Experiments on the NHS medical knowledge base demonstrate significant improvements in answer accuracy and consistency—approaching the performance of large models while outperforming existing lightweight baselines. The code and models are fully open-sourced, supporting cross-domain adaptability and reproducibility.
📝 Abstract
This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.