VLMgineer: Vision Language Models as Robotic Toolsmiths

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work investigates whether vision-language models (VLMs) can enable robots to autonomously design and utilize novel tools for everyday manipulation tasks. Method: We propose a VLM-driven co-evolutionary framework for tool morphology and control strategy, leveraging VLMs’ cross-modal understanding and code-generation capabilities to jointly synthesize tool geometry and executable control programs, while employing evolutionary search to optimize both components concurrently in physics simulation. Contribution/Results: To our knowledge, this is the first end-to-end, closed-loop application of VLMs to automated tool invention—from natural-language task description to physically realizable, executable tools. Evaluated on a newly constructed benchmark of daily manipulation tasks, our approach significantly outperforms both human-designed tools and human-instruction-driven baselines. It transforms complex tasks into robust, transferable, low-level execution sequences, demonstrating the potential of large foundation models to unify perception, creative reasoning, and physical execution in embodied intelligence.

Technology Category

Application Category

📝 Abstract

Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, these capabilities are often regarded as measurable indicators of intelligence across biological species. While much of today's research on robotic intelligence focuses on generating better controllers, inventing smarter tools offers a complementary form of physical intelligence: shifting the onus of problem-solving onto the tool's design. Given the vast and impressive common-sense, reasoning, and creative capabilities of today's foundation models, we investigate whether these models can provide useful priors to automatically design and effectively wield such tools? We present VLMgineer, a framework that harnesses the code generation abilities of vision language models (VLMs) together with evolutionary search to iteratively co-design physical tools and the action plans that operate them to perform a task. We evaluate VLMgineer on a diverse new benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. To facilitate future research on automated tool invention, we will release our benchmark and code.

Problem

Research questions and friction points this paper is trying to address.

Can vision language models design and use robotic tools effectively

Automating tool invention for diverse manipulation scenarios

Improving robotics problem-solving through innovative tool design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision language models for tool design

Combines evolutionary search with VLM code generation

Co-designs tools and action plans iteratively

🔎 Similar Papers

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs