SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the limited generalization of small language models in privacy-sensitive and resource-constrained settings, where they struggle to adapt to unfamiliar codebases due to weak out-of-distribution reasoning capabilities. To overcome this, we propose Repository-Centric Learning (RCL), a novel paradigm that shifts from traditional task-centric training to deeply internalizing the “physical laws” of a single code repository, thereby constructing lightweight, repository-specialized expert models. We introduce a four-component RCL training framework that transforms static code repositories into interactive learning signals, yielding the SWE-Spot-4B model series. These models outperform larger open-source counterparts such as Qwen3-Coder-30B across multiple software engineering benchmarks, match the performance of efficient commercial models like GPT-4.1-mini, and achieve significantly higher sample efficiency and lower inference costs.

Technology Category

Application Category

📝 Abstract

The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the"physics"of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit Repository-Centric Experience, transforming static codebases into interactive learning signals, to train SWE-Spot-4B, a family of highly compact models built as repo-specialized experts that breaks established scaling trends, outperforming open-weight models up to larger (e.g., CWM by Meta, Qwen3-Coder-30B) and surpassing/matching efficiency-focused commercial models (e.g., GPT-4.1-mini, GPT-5-nano) across multiple SWE tasks. Further analysis reveals that RCL yields higher training sample efficiency and lower inference costs, emphasizing that for building efficient intelligence, repository mastery is a distinct and necessary dimension that complements general coding capability.

Problem

Research questions and friction points this paper is trying to address.

Small Language Models

Codebase Generalization

Repository Understanding

Software Engineering Agents

Inference-time Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repository-Centric Learning

Small Language Models

Code Specialization