TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

📅 2024-12-07

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing open-source vision-language models exhibit limited performance on complex cross-modal tasks requiring fine-grained perception and multi-step reasoning—e.g., visual grounding, OCR parsing, mathematical, and spatial reasoning. To address this, we propose TACO, a multimodal large action model introducing the “Chain-of-Thought-and-Action” (CoTA) paradigm, which explicitly models reasoning as executable sequences of cognitive steps and tool invocations. Methodologically: (1) We synthesize one million high-quality CoTA trajectories using GPT-4o and Python program generation, augmented by rigorous data filtering and mixture strategies; (2) We design an end-to-end framework for joint thought–action modeling, integrating external tools including OCR, depth estimation, and calculator modules; (3) We move beyond conventional single-step instruction tuning. Evaluated across eight benchmarks, TACO achieves an average 3.6% absolute improvement over strong baselines, with a 15% gain on MMVet, demonstrating substantial advances in fine-grained perception and compositional reasoning.

Technology Category

Application Category

📝 Abstract

While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step, and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of over 1M synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.

Problem

Research questions and friction points this paper is trying to address.

Improving vision-language models for complex perceptual-reasoning tasks

Offloading perception to focus models on reasoning tasks

Enhancing models with synthesized multi-modal reasoning traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offloads perception to vision specialists

Trains with multi-modal reasoning traces

Focuses on high-quality perceptual reasoning

🔎 Similar Papers

No similar papers found.