SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited generalization of small language models in privacy-sensitive and resource-constrained settings, where they struggle to adapt to unfamiliar codebases due to weak out-of-distribution reasoning capabilities. To overcome this, we propose Repository-Centric Learning (RCL), a novel paradigm that shifts from traditional task-centric training to deeply internalizing the β€œphysical laws” of a single code repository, thereby constructing lightweight, repository-specialized expert models. We introduce a four-component RCL training framework that transforms static code repositories into interactive learning signals, yielding the SWE-Spot-4B model series. These models outperform larger open-source counterparts such as Qwen3-Coder-30B across multiple software engineering benchmarks, match the performance of efficient commercial models like GPT-4.1-mini, and achieve significantly higher sample efficiency and lower inference costs.

Technology Category

Application Category

πŸ“ Abstract
The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the"physics"of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit Repository-Centric Experience, transforming static codebases into interactive learning signals, to train SWE-Spot-4B, a family of highly compact models built as repo-specialized experts that breaks established scaling trends, outperforming open-weight models up to larger (e.g., CWM by Meta, Qwen3-Coder-30B) and surpassing/matching efficiency-focused commercial models (e.g., GPT-4.1-mini, GPT-5-nano) across multiple SWE tasks. Further analysis reveals that RCL yields higher training sample efficiency and lower inference costs, emphasizing that for building efficient intelligence, repository mastery is a distinct and necessary dimension that complements general coding capability.
Problem

Research questions and friction points this paper is trying to address.

Small Language Models
Codebase Generalization
Repository Understanding
Software Engineering Agents
Inference-time Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repository-Centric Learning
Small Language Models
Code Specialization
Parametric Knowledge Acquisition
Software Engineering Agents
πŸ”Ž Similar Papers
No similar papers found.
J
Jinjun Peng
Department of Computer Science, Columbia University, New York, USA
M
Magnus Saebo
Department of Computer Science, Columbia University, New York, USA
T
Tianjun Zhong
Department of Computer Science, Columbia University, New York, USA
Y
Yi-Jie Cheng
Department of Computer Science, Columbia University, New York, USA
Junfeng Yang
Junfeng Yang
Professor of Computer Science, Columbia University
Operating systemssoftware reliabilityconcurrencyand security.
Baishakhi Ray
Baishakhi Ray
Associate Professor, Columbia University
Software EngineeringMachine LearningAI4CodeAI4SESE4AI
Simin Chen
Simin Chen
Columbia University
Software EngineeringMachine Learning
Y
Yangruibo Ding
Computer Science Department, University of California, Los Angeles, California, USA