Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address copyright and legal risks arising from the opacity of pretraining data in closed-source large language models (LLMs), this paper proposes a novel paradigm that enhances reasoning performance of small models (1B–8B) without knowledge distillation—relying solely on their intrinsic capabilities. Methodologically, we introduce a multi-instance behavioral differentiation mechanism to generate diverse reasoning traces, and design COLLATE, a trainable preference-based reasoning-chain selection framework that incorporates DPO/RLHF variants for dynamic filtering and optimization, with cross-model-family adaptability. Our key contribution is the first realization of collaborative, autonomous reasoning entirely within small models—eliminating dependence on external LLMs for knowledge. Evaluated across five benchmarks spanning mathematical problem solving, natural language inference, and commonsense reasoning, our approach significantly outperforms prompt engineering and trainable baselines, demonstrating strong generalizability and practical utility for medium- and small-scale models.

Technology Category

Application Category

📝 Abstract
LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (https://github.com/Sohanpatnaik106/collate).
Problem

Research questions and friction points this paper is trying to address.

Improving small LLMs' reasoning without large model distillation
Generating diverse rationales to enhance downstream task performance
Optimizing rationale selection to maximize ground-truth answer likelihood
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tune small LLMs to select optimal rationales
Enforce diverse behaviors in same LLM instances
Preference optimization for ground-truth alignment
🔎 Similar Papers