Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and low resource utilization of large language models (LLMs) in edge computing scenarios, this paper proposes the first token-level concurrent inference paradigm: multiple lightweight “thinkers” execute in parallel within a single LLM, sharing intermediate hidden states and dynamically monitoring each other’s token-generation progress to enable real-time coordination—including dynamic termination, handoff, and directional alignment. The method requires no model retraining or architectural modification; only lightweight adaptations to existing LLMs are needed for efficient deployment on underutilized local GPU resources. Experimental evaluation on open-source LLMs demonstrates up to 42% end-to-end latency reduction, alongside improved accuracy and consistency on complex reasoning tasks. The approach thus achieves a favorable trade-off among low-latency inference, high output quality, and edge-device compatibility.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.
Problem

Research questions and friction points this paper is trying to address.

Enabling multiple reasoning agents to collaborate concurrently at token level
Reducing redundant reasoning while improving quality and lowering latency
Efficiently utilizing idle computational resources for edge inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple concurrent reasoning agents collaborate token-level
Shared visibility enables dynamic adaptive reasoning trajectories
Efficient GPU utilization for edge inference optimization
🔎 Similar Papers
No similar papers found.