π€ AI Summary
Current AI systems face significant challenges in autonomously generating interactive audiovisual content (e.g., web-based games): while large language models (LLMs) can produce JavaScript code, they lack automated evaluation mechanisms and struggle to incorporate custom assets or multimodal feedback for iterative quality improvement. To address this, we propose AVR-Agentβa multi-agent framework integrating cross-modal foundation models with audiovisual recording-based feedback loops, enabling end-to-end content generation, automated comparative evaluation, and closed-loop optimization. We further introduce AVR-Eval, the first metric enabling relative performance assessment grounded in perceptual output rather than symbolic correctness. Experiments demonstrate that AVR-Agent significantly outperforms single-shot generation baselines under adversarial evaluation, exposing critical bottlenecks in current modelsβ ability to integrate domain-specific assets and audiovisual feedback. This work establishes a novel paradigm and reproducible evaluation framework for autonomous interactive multimedia content generation.
π Abstract
While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system.
We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content.
We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR.
We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.