🤖 AI Summary
This work addresses the key challenge of achieving efficient, high-quality personalized generation by synergizing the personalization capabilities of local small models with the powerful reasoning capacity of cloud-based large models, all while preserving user privacy. The authors propose an asymmetric edge-cloud collaborative inference framework that, for the first time, reformulates speculative decoding as a distributed alignment protocol. By leveraging Bayesian knowledge fusion, the approach securely integrates private user context with cloud-side inference. A novel “draft–verify–recover” pipeline is introduced, incorporating ratio-based verification and intent-guided recovery mechanisms to enable logical validation and intent injection without exposing raw user data. Experiments demonstrate that the method significantly enhances generation quality while maintaining strict privacy guarantees, achieving a 2.36× speedup over baseline approaches.
📝 Abstract
Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft--Verify--Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.