MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the lack of scalable, high-quality benchmarks for automating the generation of model cards and data cards in generative AI, a gap that hinders transparency and governance. To bridge this, we introduce MetaGAI, a large-scale benchmark comprising 2,541 verified document triplets aligned through semantic triangulation across academic papers, GitHub repositories, and Hugging Face artifacts. We propose a multi-agent collaborative framework—comprising retriever, generator, and editor modules—augmented with a human feedback loop and evaluated via LLM-as-a-Judge. Our experiments demonstrate that sparse mixture-of-experts (MoE) architectures achieve optimal efficiency in balancing cost and quality. This study further uncovers a fundamental trade-off between faithfulness and completeness and establishes a robust evaluation protocol combining automated metrics with human validation, offering a reproducible and scalable foundation for generating trustworthy documentation.

Technology Category

Application Category

📝 Abstract

The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large-scale, high-fidelity benchmarks for systematic evaluation. We introduce MetaGAI, a comprehensive benchmark comprising 2,541 verified document triplets constructed through semantic triangulation of academic papers, GitHub repositories, and Hugging Face artifacts. Unlike prior single-source datasets, MetaGAI employs a multi-agent framework with specialized Retriever, Generator, and Editor agents, validated through four-dimensional human-in-the-loop assessment, including human evaluation of editor-refined ground truth. We establish a robust evaluation protocol combining automated metrics with validated LLM-as-a-Judge frameworks. Extensive analysis reveals that sparse Mixture-of-Experts architectures achieve superior cost-quality efficiency, while a fundamental trade-off exists between faithfulness and completeness. MetaGAI provides a foundational testbed for benchmarking, training, and analyzing automated Model and Data Card generation methods at scale. Our data and code are available at: https://github.com/haoxuan-unt2024/MetaGAI-Benchmark.

Problem

Research questions and friction points this paper is trying to address.

Generative AI

Model Cards

Data Cards

benchmark

automated documentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Cards

Data Cards

Multi-agent Framework

Semantic Triangulation

Mixture-of-Experts

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

2024-04-20arXiv.orgCitations: 2