LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates large language models’ multimodal imitation learning capability under ultra-long contexts (up to 1M tokens), specifically examining whether they can effectively acquire interactive decision-making policies from extensive expert demonstrations (0–512 full interaction episodes) embedded within such contexts. To this end, we introduce the first benchmark for long-horizon multimodal demonstration learning, covering five diverse tasks—chess, Atari, grid navigation, crossword solving, and biomimetic control—and supporting both textual and visual observations, as well as chain-of-thought prompting for attribution analysis. Our methodology integrates multimodal context encoding, long-context reasoning, and a unified zero-/few-/many-shot evaluation framework. Experiments reveal that mainstream closed-source models generally fail to approach expert-level performance; increasing demonstration count yields marginal gains across most tasks; only a few models exhibit consistent improvements on specific tasks; and the efficacy of visual vs. textual observations and chain-of-thought prompting is strongly task-dependent. The benchmark is publicly released.

Technology Category

Application Category

📝 Abstract
In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context $unicode{x2013}$ from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.
Problem

Research questions and friction points this paper is trying to address.

Multimodal decision-making in long-context regimes.
Learning from large numbers of expert demonstrations.
Evaluating frontier models on interactive decision-making tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal decision-making benchmark
Long-context regime testing
Expert demonstration learning evaluation
🔎 Similar Papers
No similar papers found.