🤖 AI Summary
This work addresses the inefficiency of traditional 3D scene editing and the challenge faced by existing vision-language agents in balancing precision with responsiveness. We propose EZBlender, a novel hybrid agent framework that integrates deliberative task decomposition with reactive local autonomy through a pioneering Plan-and-ReAct architecture. This design ensures semantic consistency and high editing quality while substantially reducing latency and computational overhead. To rigorously evaluate our approach, we introduce the first systematic benchmark for multi-task 3D editing, on which EZBlender demonstrates significant advantages over prior methods in terms of language model preference, response speed, and computational cost.
📝 Abstract
As a cornerstone of the modern digital economy, 3D modeling and rendering demand substantial resources and manual effort when scene editing is performed in the traditional manner. Despite recent progress in VLM-based agents for 3D editing, the fundamental trade-off between editing precision and agent responsiveness remains unresolved. To overcome these limitations, we present EZBlender, a Blender agent with a hybrid framework that combines planning-based task decomposition and reactive local autonomy for efficient human AI collaboration and semantically faithful 3D editing. Specifically, this unexplored Plan-and-ReAct design not only preserves editing quality but also significantly reduces latency and computational cost. To further validate the efficiency and effectiveness of the proposed edge-autonomy architecture, we construct a dedicated multi-tasking benchmark that has not been systematically investigated in prior research. In addition, we provide a comprehensive analysis of language model preference, system responsiveness, and economic efficiency.