Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of large language models (LLMs) struggling to interpret cartoon visual question answering (VQA) due to exaggerated visual styles and complex narrative contexts. To this end, we propose the first multi-agent LLM framework specifically designed for cartoon VQA, comprising three collaborative agents: a visual agent for feature extraction, a language agent for semantic understanding, and a critic agent for reasoning correction. This architecture enables structured cross-modal reasoning by explicitly delineating each agent’s role in the multimodal inference process, thereby advancing the understanding of how LLMs reason over non-photorealistic imagery. Experiments on the Pororo and Simpsons datasets demonstrate that our framework significantly improves answer accuracy, and ablation studies confirm the effectiveness of each component.

Technology Category

Application Category

📝 Abstract

Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.

Problem

Research questions and friction points this paper is trying to address.

Cartoon VQA

Visual Abstraction

Narrative Context

Large Language Models

Multimodal Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent LLM

cartoon VQA

structured reasoning