π€ AI Summary
In few-shot multimodal dialogue intent recognition for e-commerce, multi-task training induces a seesaw effectβinter-task knowledge interference caused by cumulative weight updates. To address this, we propose a collaborative learning framework integrating large-model post-training with small-model regularized knowledge decoupling. Our approach pioneers a knowledge-decoupling paradigm that separates the strong representation capability of multimodal large language models (MLLMs) from the interpretable rule-generation capacity of lightweight models. Specifically, it unifies MLLMs, a lightweight rule distillation network, a collaborative prediction mechanism, and a few-shot adaptive fine-tuning strategy to eliminate cross-task weight conflicts and enable positive knowledge transfer. Evaluated on a real-world Taobao dataset, our method achieves online weighted F1 improvements of 6.37% and 6.28%, significantly outperforming state-of-the-art approaches.
π Abstract
Few-shot multimodal dialogue intention recognition is a critical challenge in the e-commerce domainn. Previous methods have primarily enhanced model classification capabilities through post-training techniques. However, our analysis reveals that training for few-shot multimodal dialogue intention recognition involves two interconnected tasks, leading to a seesaw effect in multi-task learning. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. To address these challenges, we propose Knowledge-Decoupled Synergetic Learning (KDSL), which mitigates these issues by utilizing smaller models to transform knowledge into interpretable rules, while applying the post-training of larger models. By facilitating collaboration between the large and small multimodal large language models for prediction, our approach demonstrates significant improvements. Notably, we achieve outstanding results on two real Taobao datasets, with enhancements of 6.37% and 6.28% in online weighted F1 scores compared to the state-of-the-art method, thereby validating the efficacy of our framework.