MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Existing benchmarks for terminal-based agents primarily focus on text and code tasks, lacking evaluation of multimedia manipulation capabilities such as audio and video processing. To address this gap, this work introduces MMTB—the first benchmark specifically designed for evaluating terminal agents on audiovisual file operations—comprising 105 real-world multimedia terminal tasks. Furthermore, the authors extend the Terminus-KIRA architecture to develop Terminus-MM, a novel agent framework that integrates terminal tool invocation, audiovisual content understanding, and cross-modal reasoning to enable multimedia-aware decision-making. The study elucidates how multimedia perception influences agent planning and execution trajectories and publicly releases both the MMTB dataset and the Terminus-MM framework, establishing a standardized evaluation platform and a controllable experimental foundation for future research on multimedia-capable terminal agents.
📝 Abstract
Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows. MMTB media and metadata are released at https://huggingface.co/datasets/mm-tbench/mmtb-media
Problem

Research questions and friction points this paper is trying to address.

terminal agents
multimedia files
benchmark
audio-video tasks
AI evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimedia terminal agents
MMTB benchmark
Terminus-MM
audio-video perception
executable workflow generation
🔎 Similar Papers
No similar papers found.