🤖 AI Summary
This work investigates the significant token overhead introduced by JSON formatting in agent tool invocation, which undermines efficiency. For the first time, it decouples and evaluates the input comprehension and output generation performance of lightweight alternatives—TOON and TRON—within multi-turn, end-to-end agent benchmarks. The study systematically compares their token consumption and task accuracy across four major benchmarks (including BFCL and MCPToolBenchPP) and five open-source large language models. Experimental results demonstrate that TRON reduces token usage by up to 27% with only a 14-percentage-point drop in accuracy, whereas TOON, despite saving 18% tokens, frequently causes multi-turn parsing failures and disrupts parallel tool-call structures. This work thus reveals both the practical benefits and inherent limitations of token-optimized formats in real-world agent scenarios.
📝 Abstract
Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.