🤖 AI Summary
This study addresses the challenge large language models (LLMs) face in generating context-dependent satirical content that effectively balances political relevance and humor. To tackle this, the work introduces retrieval-augmented generation (RAG) into satire generation for the first time, leveraging recent news articles to construct a Finnish-context satire lexicon. It further proposes a multidimensional human evaluation framework that integrates cultural context and etymological categories. The project releases annotated datasets and code publicly and employs LLM-as-a-judge for automated assessment. Experimental results demonstrate that RAG significantly enhances political relevance but yields limited improvement in perceived humor. Moreover, LLM-based evaluations show high alignment with human judgments on political content, yet exhibit notably lower agreement regarding humor.
📝 Abstract
Humor generation remains challenging task for Large Language Models (LLMs), due to their subjective nature. We focus on satire, a form of humor strongly shaped by context. In this work, we present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions in the Finnish context. We also introduce a new task-specific evaluation framework and annotate 100 generated definitions with six human annotators, enabling analysis across multiple experimental conditions, including cultural background, source-word type, and the presence or absence of RAG. Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance of the outputs, but neither yields clear gains in humor generation. In addition, our LLM-as-a-judge evaluation of five state-of-the-art models indicates that LLMs correlate well with human judgments on political relevance, but perform poorly on humor. We release our code and annotated dataset to support further research on grounded satire generation and evaluation.