🤖 AI Summary
This work investigates the capability of multimodal autonomous agents in real-world news writing, specifically their ability to bridge information gaps and generate structured journalistic narratives. Method: We introduce the first multimodal agent benchmark tailored for news writing, requiring agents to autonomously perform web navigation, cross-source multimodal information retrieval, factual filtering, and narrative integration—unifying webpage-level multimodal exploration with narrative planning to emulate journalists’ active information-gap filling. Our approach integrates LLMs with state-of-the-art agent frameworks, leveraging keyword-based retrieval, historical context grounding, and multi-step reasoning for end-to-end news generation. Contribution/Results: Experiments reveal that while current agents excel at factual retrieval, they exhibit significant bottlenecks in task decomposition, long-horizon planning, and narrative coherence. The benchmark provides a quantifiable evaluation standard and concrete optimization directions for advancing multimodal autonomous agents.
📝 Abstract
Recent advances in autonomous digital agents from industry (e.g., Manus AI and Gemini's research mode) highlight potential for structured tasks by autonomous decision-making and task decomposition; however, it remains unclear to what extent the agent-based systems can improve multimodal web data productivity. We study this in the realm of journalism, which requires iterative planning, interpretation, and contextual reasoning from multimodal raw contents to form a well structured news. We introduce NEWSAGENT, a benchmark for evaluating how agents can automatically search available raw contents, select desired information, and edit and rephrase to form a news article by accessing core journalistic functions. Given a writing instruction and firsthand data as how a journalist initiates a news draft, agents are tasked to identify narrative perspectives, issue keyword-based queries, retrieve historical background, and generate complete articles. Unlike typical summarization or retrieval tasks, essential context is not directly available and must be actively discovered, reflecting the information gaps faced in real-world news writing. NEWSAGENT includes 6k human-verified examples derived from real news, with multimodal contents converted to text for broad model compatibility. We evaluate open- and closed-sourced LLMs with commonly-used agentic frameworks on NEWSAGENT, which shows that agents are capable of retrieving relevant facts but struggling with planning and narrative integration. We believe that NEWSAGENT serves a realistic testbed for iterating and evaluating agent capabilities in terms of multimodal web data manipulation to real-world productivity.