Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

๐Ÿ“… 2026-03-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limited generalization of existing models to unseen intent combinationsโ€”a common challenge in real-world applications. To tackle this, the authors propose ClauseCompose, a method leveraging clause-level factorized decoding that trains lightweight decoders exclusively on single-intent data yet effectively recognizes novel intent compositions. The study also introduces CoMIX-Shift, the first benchmark specifically designed for compositional generalization in multi-intent understanding, featuring controlled data construction and zero-shot triplet evaluation protocols. Experimental results demonstrate that ClauseCompose achieves a 95.7% exact match accuracy on unseen intent pairs, substantially outperforming strong baselines such as full-sentence multilabel classification and fine-tuned BERT models.
๐Ÿ“ Abstract
Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.
Problem

Research questions and friction points this paper is trying to address.

compositional generalization
multi-intent detection
unseen intent combinations
benchmarking
zero-shot triples
Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional generalization
multi-intent detection
clause-factorized decoding
zero-shot intent combination
controlled benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.