π€ AI Summary
Existing advertising text quality evaluation lacks multidimensional, realistic, and Japanese-language benchmarks. Method: We propose AdTECβthe first Japanese multidimensional advertising copy evaluation benchmark tailored for search advertising. Grounded in operational expertise from leading Japanese advertising agencies, AdTEC covers five business-critical dimensions: relevance, appeal, compliance, creativity, and readability, supported by a high-quality human-annotated dataset. We formally define the multidimensional advertising copy evaluation task, integrating pretrained language models (PLMs) for automated assessment, expert annotation, and comparative analysis, while introducing operation-oriented task modeling and data construction techniques. Results: Experiments show that state-of-the-art PLMs achieve near-human performance on foundational dimensions (e.g., relevance) but exhibit substantial gaps in creativity and compliance judgment. AdTEC establishes the first open-source, reproducible, and extensible industrial-grade evaluation standard for advertising generation systems.
π Abstract
With the increase in the fluency of ad texts automatically created by natural language generation technology, there is high demand to verify the quality of these creatives in a real-world setting. We propose AdTEC (Ad Text Evaluation Benchmark by CyberAgent), the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations. Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as building a Japanese dataset based on the practical operational experiences of building a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically kept in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on the dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. The results show that while PLMs have already reached practical usage level in several tasks, humans still outperform in certain domains, implying that there is significant room for improvement in this area.