Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the quality–latency trade-off confronting LLM agents in ultra-low-latency applications—such as high-frequency trading (HFT) and real-time competitive gaming—where static inference configurations yield suboptimal performance. We first systematically characterize task-dependent optimal latency–quality Pareto fronts. To this end, we propose FPX, an adaptive inference framework that jointly optimizes model size and quantization granularity via quantization-aware inference, real-time environment modeling, and latency-sensitive reinforcement evaluation. We introduce two novel benchmarks: HFTBench for financial decision-making under microsecond-scale constraints, and StreetFighter for reactive, adversarial game-playing. Experiments demonstrate that FPX achieves an 80% win-rate improvement in StreetFighter and a 26.52% increase in daily profit-and-loss (P&L) in HFT, significantly outperforming static deployment baselines. Our work establishes a scalable, empirically grounded methodology for deploying latency-critical LLM agents.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Balancing speed and accuracy in latency-sensitive LLM decisions

Exploring latency-quality trade-offs in real-time LLM-based agents

Adaptive model selection for optimal latency-performance balance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive framework for dynamic model selection

Real-time latency-quality trade-off optimization

Task-specific benchmarks for performance evaluation

🔎 Similar Papers

Evaluating the Performance of Large Language Models via Debates