ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Frontend code correctness critically depends on visual rendering and interactive behavior—modalities that current large language models (LLMs) struggle to model at the pixel level, limiting generation quality. To address this, we propose a vision-aware reinforcement learning framework that decouples training and inference. During training, a multimodal LLM serves as a visual critic, enabling screenshot-based closed-loop optimization under zero-reward constraints. At inference, lightweight self-editing and Forced Optimization mechanisms enhance real-time correction capability. Our approach achieves significant improvements over strong baselines across three mainstream frontend benchmarks. It is the first work to empirically validate both the effectiveness and scalability of vision-derived reward signals and agent-style perceptual feedback for frontend code generation—demonstrating that grounding code synthesis in visual execution semantics substantially improves functional correctness and user-aligned behavior.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate--diagnose--refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent uses the MLLM-in-the-loop both as a visual critic--scoring code with screenshots--and as a source of actionable, vision-grounded feedback; a strict zero-reward rule for invalid renders anchors renderability and prevents reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training-inference decoupling.
Problem

Research questions and friction points this paper is trying to address.

Addresses vision-grounded front-end code generation challenges
Enables agentic generate-diagnose-refine loop with multimodal feedback
Improves rendered pixel correctness through visual critic integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-grounded RL with multimodal LLM critic
Forced Optimization ensures monotonically improving revisions
Lightweight critic-free self-edit cycle at inference
🔎 Similar Papers
No similar papers found.
Yuhang Li
Yuhang Li
Yale University
Machine Learning
C
Chenchen Zhang
LLM Department, Tencent
R
Ruilin Lv
Independent Researcher
A
Ao Liu
LLM Department, Tencent
Ken Deng
Ken Deng
Kwaipilot Team, Kuaishou Technology
LLMAI4SEAI Agent
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
J
Jiaheng Liu
Nanjing University
W
Wiggin Zhou
LLM Department, Tencent
B
Bo Zhou
LLM Department, Tencent