ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Frontend code correctness critically depends on visual rendering and interactive behavior—modalities that current large language models (LLMs) struggle to model at the pixel level, limiting generation quality. To address this, we propose a vision-aware reinforcement learning framework that decouples training and inference. During training, a multimodal LLM serves as a visual critic, enabling screenshot-based closed-loop optimization under zero-reward constraints. At inference, lightweight self-editing and Forced Optimization mechanisms enhance real-time correction capability. Our approach achieves significant improvements over strong baselines across three mainstream frontend benchmarks. It is the first work to empirically validate both the effectiveness and scalability of vision-derived reward signals and agent-style perceptual feedback for frontend code generation—demonstrating that grounding code synthesis in visual execution semantics substantially improves functional correctness and user-aligned behavior.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.

Problem

Research questions and friction points this paper is trying to address.

Addresses vision-grounded front-end code generation challenges

Enables agentic generate-diagnose-refine loop with multimodal feedback

Improves rendered pixel correctness through visual critic integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-grounded RL with multimodal LLM critic

Forced Optimization ensures monotonically improving revisions

Lightweight critic-free self-edit cycle at inference

🔎 Similar Papers

No similar papers found.