IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the computational redundancy in existing video person re-identification methods, which typically rely on fixed, heavyweight multimodal models. To overcome this limitation, the authors propose IDSelect, a lightweight, reinforcement learning–based dynamic model selection agent that adaptively chooses an optimal combination of pretrained models according to the input video sequence. IDSelect introduces, for the first time, an input-aware dynamic selection mechanism, leveraging an Actor-Critic framework, a budget-aware reward function, and entropy regularization to achieve a Pareto-optimal trade-off between accuracy and efficiency. Experimental results demonstrate that IDSelect attains a Rank-1 accuracy of 95.9% on CCVID—improving performance by 1.8%—while reducing computational cost by 92.4%. On MEVID, it cuts computational overhead by 41.3% without compromising competitive performance.

Technology Category

Application Category

📝 Abstract

Video-based person recognition achieves robust identification by integrating face, body, and gait. However, current systems waste computational resources by processing all modalities with fixed heavyweight ensembles regardless of input complexity. To address these limitations, we propose IDSelect, a reinforcement learning-based cost-aware selector that chooses one pre-trained model per modality per-sequence to optimize the accuracy-efficiency trade-off. Our key insight is that an input-conditioned selector can discover complementary model choices that surpass fixed ensembles while using substantially fewer resources. IDSelect trains a lightweight agent end-to-end using actor-critic reinforcement learning with budget-aware optimization. The reward balances recognition accuracy with computational cost, while entropy regularization prevents premature convergence. At inference, the policy selects the most probable model per modality and fuses modality-specific similarities for the final score. Extensive experiments on challenging video-based datasets demonstrate IDSelect's superior efficiency: on CCVID, it achieves 95.9% Rank-1 accuracy with 92.4% less computation than strong baselines while improving accuracy by 1.8%; on MEVID, it reduces computation by 41.3% while maintaining competitive performance.

Problem

Research questions and friction points this paper is trying to address.

video-based person recognition

multi-modal fusion

computational efficiency

accuracy-efficiency trade-off

resource waste

Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement learning

cost-aware selection

multi-modal person recognition