Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the performance collapse observed in multimodal large language models when scaling to massive image classification tasks due to label space expansion. The authors propose Divide-and-Conquer Inference (DCI), a training-free, plug-and-play, model-agnostic paradigm that recursively decomposes the global classification task into localized subproblems during inference, dynamically pruning irrelevant candidates. Guided by information-theoretic principles, DCI mitigates the tension between attention dilution and escalating information entropy while circumventing the quadratic complexity of self-attention. This approach substantially enhances signal-to-noise ratio and inference efficiency. Remarkably, lightweight open-source models equipped with DCI achieve performance on ImageNet-1K and ImageNet-21K that rivals or even surpasses that of state-of-the-art closed-source models, all without fine-tuning.

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

large-scale visual recognition

performance collapse

long sequence recognition

attention dilution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-Conquer Inference

Multimodal Large Language Models

Performance Collapse