🤖 AI Summary
Large language models (LLMs) deployed as autonomous agents frequently rely on Python-based tool-calling code, yet suffer from performance bottlenecks, security vulnerabilities, and insufficient reliability—hindering real-world deployment. To address these challenges, we propose Quasar, a lightweight domain-specific language explicitly designed for LLM agent code actions. Quasar introduces three novel, tightly integrated capabilities: automatic parallelization, uncertainty quantification (to mitigate hallucination), and user-verifiable safety policies. It supports transparent transpilation from a safe Python subset and ensures trustworthiness via static analysis, sandboxed execution, and conformal prediction. Evaluated on the GQA benchmark using the ViperGPT agent, Quasar reduces execution time by 42%, decreases user approval interactions by 52%, and achieves the target reliability coverage—demonstrating significant gains in efficiency, safety, and dependability for LLM-driven automation.
📝 Abstract
Modern large language models (LLMs) are often deployed as agents, calling external tools adaptively to solve tasks. Rather than directly calling tools, it can be more effective for LLMs to write code to perform the tool calls, enabling them to automatically generate complex control flow such as conditionals and loops. Such code actions are typically provided as Python code, since LLMs are quite proficient at it; however, Python may not be the ideal language due to limited built-in support for performance, security, and reliability. We propose a novel programming language for code actions, called Quasar, which has several benefits: (1) automated parallelization to improve performance, (2) uncertainty quantification to improve reliability and mitigate hallucinations, and (3) security features enabling the user to validate actions. LLMs can write code in a subset of Python, which is automatically transpiled to Quasar. We evaluate our approach on the ViperGPT visual question answering agent, applied to the GQA dataset, demonstrating that LLMs with Quasar actions instead of Python actions retain strong performance, while reducing execution time when possible by 42%, improving security by reducing user approval interactions when possible by 52%, and improving reliability by applying conformal prediction to achieve a desired target coverage level.