🤖 AI Summary
Large language model (LLM) agents frequently suffer from “agent–environment misalignment” in interactive decision-making—caused by rigid, predefined interfaces—where internal action expectations diverge from actual state transitions, forming a critical performance bottleneck. This paper proposes ALIGN, a lightweight, non-intrusive framework for automatic interface generation that requires no modification to agent or environment code. ALIGN leverages LLM-driven interface meta-generation, environment-aware enhanced modeling, and dynamic observation structuring to automatically align static environment information with step-wise observations. It is the first work to systematically characterize the impact of interface-layer mismatch and establishes a fully automated, fine-tuning-free interface generation paradigm generalizable across diverse LLM backbones and agent architectures. Evaluated on multi-task benchmarks including ALFWorld, ALIGN achieves up to a 45.67% absolute improvement in task success rate.
📝 Abstract
Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision-making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as extbf{agent-environment misalignment}. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent-environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose extbf{ALIGN}, an underline{A}uto-Aunderline{l}igned underline{I}nterface underline{G}eunderline{n}eration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN-generated interface enhances both the static information of the environment and the step-wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool-use, show consistent performance improvements, with up to a 45.67% success rate improvement observed in ALFWorld. Meanwhile, ALIGN-generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at https://github.com/THUNLP-MT/ALIGN.