AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising

📅 2024-08-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing advertising text quality evaluation lacks multidimensional, realistic, and Japanese-language benchmarks. Method: We propose AdTEC—the first Japanese multidimensional advertising copy evaluation benchmark tailored for search advertising. Grounded in operational expertise from leading Japanese advertising agencies, AdTEC covers five business-critical dimensions: relevance, appeal, compliance, creativity, and readability, supported by a high-quality human-annotated dataset. We formally define the multidimensional advertising copy evaluation task, integrating pretrained language models (PLMs) for automated assessment, expert annotation, and comparative analysis, while introducing operation-oriented task modeling and data construction techniques. Results: Experiments show that state-of-the-art PLMs achieve near-human performance on foundational dimensions (e.g., relevance) but exhibit substantial gaps in creativity and compliance judgment. AdTEC establishes the first open-source, reproducible, and extensible industrial-grade evaluation standard for advertising generation systems.

Technology Category

Application Category

📝 Abstract

With the increase in the fluency of ad texts automatically created by natural language generation technology, there is high demand to verify the quality of these creatives in a real-world setting. We propose AdTEC (Ad Text Evaluation Benchmark by CyberAgent), the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations. Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as building a Japanese dataset based on the practical operational experiences of building a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically kept in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on the dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. The results show that while PLMs have already reached practical usage level in several tasks, humans still outperform in certain domains, implying that there is significant room for improvement in this area.

Problem

Research questions and friction points this paper is trying to address.

Evaluating quality of auto-generated ad texts in real-world settings

Creating first public benchmark for multi-perspective ad text assessment

Comparing performance of pre-trained models vs humans on ad tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

First public benchmark for ad text evaluation

Uses Japanese dataset from advertising agencies

Tests pre-trained models and human evaluators

🔎 Similar Papers

No similar papers found.