MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing structured pruning methods for vision-language models struggle to preserve chain-of-thought (CoT) reasoning capabilities and often overlook the disparity in activation distributions between visual and textual modalities. This work proposes MuCRASP, a novel framework that, for the first time, incorporates CoT reasoning awareness into multimodal pruning. MuCRASP identifies pivotal tokens along generation trajectories, explicitly models cross-modal activation differences, and integrates layer-wise sensitivity analysis with alignment constraints to achieve efficient compression under a global parameter budget. Evaluated on Qwen2.5-VL-7B, the method achieves an LLM-as-a-Judge score of 8.87 on physical reasoning tasks at 30% sparsity—significantly outperforming the baseline score of 7.32—and maintains high reasoning consistency with reduced perplexity degradation even at 50% sparsity.

📝 Abstract

Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.

Problem

Research questions and friction points this paper is trying to address.

structured pruning

chain-of-thought reasoning

vision-language models

multimodal reasoning

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured pruning

chain-of-thought reasoning

vision-language models