Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

πŸ“… 2025-08-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current AI systems face significant challenges in autonomously generating interactive audiovisual content (e.g., web-based games): while large language models (LLMs) can produce JavaScript code, they lack automated evaluation mechanisms and struggle to incorporate custom assets or multimodal feedback for iterative quality improvement. To address this, we propose AVR-Agentβ€”a multi-agent framework integrating cross-modal foundation models with audiovisual recording-based feedback loops, enabling end-to-end content generation, automated comparative evaluation, and closed-loop optimization. We further introduce AVR-Eval, the first metric enabling relative performance assessment grounded in perceptual output rather than symbolic correctness. Experiments demonstrate that AVR-Agent significantly outperforms single-shot generation baselines under adversarial evaluation, exposing critical bottlenecks in current models’ ability to integrate domain-specific assets and audiovisual feedback. This work establishes a novel paradigm and reproducible evaluation framework for autonomous interactive multimedia content generation.

Technology Category

Application Category

πŸ“ Abstract
While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.
Problem

Research questions and friction points this paper is trying to address.

Automated evaluation of interactive audio-visual content quality
Multi-agent generation of complex multimedia JavaScript games
Effective utilization of custom assets and audio-visual feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

AVR-Eval metric for multimedia content quality
Multi-agent system generating JavaScript code
Omni-modal model for text, video, audio processing
πŸ”Ž Similar Papers
No similar papers found.