COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-horizon tasks, LLM-based agents suffer from degraded coherence and accuracy due to error accumulation, hallucination, and context overload—stemming primarily from inadequate dynamic context management and insufficient coordination across multi-step reasoning. To address this, we propose a hierarchical three-module collaborative architecture: a primary agent for tactical execution, a meta-thinker for strategic oversight and reflective intervention, and a context manager that maintains high-information-density state via dynamic summarization and lightweight scheduling. This design enables test-time scaling and facilitates efficient post-training optimization for smaller models. Evaluated on GAIA, BrowseComp, and Humanity’s Last Exam, our approach achieves up to a 20% absolute accuracy gain, matches the performance of DeepResearch, and significantly improves both reasoning efficiency and long-range coherence.

Technology Category

Application Category

📝 Abstract
Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck -- extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks -- GAIA, BrowseComp, and Humanity's Last Exam -- COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addressing context management in long-horizon reasoning tasks
Reducing error accumulation and hallucinations in LLM agents
Enhancing coherence through specialized hierarchical agent components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework with three specialized components
Meta-Thinker monitors progress and issues interventions
Context Manager maintains concise relevant progress briefs
🔎 Similar Papers
No similar papers found.
Guangya Wan
Guangya Wan
University of Virginia
Deep LearningLarge Language Model
M
Mingyang Ling
Google Cloud AI
Xiaoqi Ren
Xiaoqi Ren
Google
LLM
Rujun Han
Rujun Han
Google
NLPMachine Learning
S
Sheng Li
University of Virginia
Z
Zizhao Zhang
Google Cloud AI