Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study investigates whether small language models can effectively replace large models in executing terminal tasks within agent systems to alleviate the main agent’s contextual burden. We specialize Qwen3-4B—dubbed Terminus-4B—through supervised fine-tuning (SFT) and reinforcement learning (RL) guided by LLM-as-judge scoring criteria, tailoring it specifically for subtask execution. Experimental results demonstrate that Terminus-4B matches or surpasses state-of-the-art large models on both SWE-Bench Pro and an internal C# benchmark, while reducing the main agent’s token consumption by approximately 30% and substantially decreasing its direct terminal invocations. This work provides the first evidence that a purpose-trained small model can efficiently substitute large models in terminal tasks, achieving a favorable balance between performance and computational efficiency.

📝 Abstract

Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's context window clean by isolating verbose outputs (e.g. build logs, test results, etc.) within the subagent context. Typically when agents employ subagents for such tasks, they use frontier models as these subagents. In this paper, we investigate whether a finetuned small language model (SLM) can achieve comparable performance to frontier models in the task of agentic terminal execution. We present Terminus-4B, which is a post-trained Qwen3-4B model via Supervised Finetuning (SFT) and Reinforcement Learning (RL) using rubric-based LLM-as-judge reward, specifically for this task. In our extensive evaluation spanning various frontier models, training ablations and main agent configurations, we find that Terminus-4B is able to reduce the token usage of the main agent by up to ~30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro and our internal SWE-Bench C# benchmark, which tends to be heavy in verbose execution tasks. Furthermore, Terminus-4B improves key metrics showing the main agent relying on the outputs of the subagent and doing fewer terminal execution tasks by itself. We see that our model not only closes the gap between the Vanilla Qwen model and frontier models like Claude Sonnet / Opus / GPT-5.3-Codex, but often even exceeds their performance.

Problem

Research questions and friction points this paper is trying to address.

agentic execution

small language model

subagent

terminal execution

frontier LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

small language model

agentic execution

subagent