LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning

๐Ÿ“… 2025-10-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of real-time responsiveness of digital avatars and high inference latency of large language models (LLMs) in AI-powered e-commerce live streaming, this paper proposes LiveThinkingโ€”a two-stage lightweight inference framework. In Stage I, knowledge distillation from a large model is performed via rejection sampling fine-tuning (RFT) into a 30B Mixture-of-Experts (MoE) model with only 3B parameters activated per inference. In Stage II, Group Relative Policy Optimization (GRPO), a novel multi-objective reinforcement learning algorithm, jointly optimizes correctness, helpfulness, and response length. LiveThinking achieves sub-second latency and reduces computational cost by 30ร—. Evaluated on Taobao Live, it improves answer accuracy by 3.3% and helpfulness by 21.8%, while significantly boosting GMV. This work introduces the first inference compression paradigm integrating knowledge distillation with multi-objective GRPO-based RL, establishing a new methodological foundation for low-latency AI-human interaction.

Technology Category

Application Category

๐Ÿ“ Abstract
In AI-powered e-commerce livestreaming, digital avatars require real-time responses to drive engagement, a task for which high-latency Large Reasoning Models (LRMs) are ill-suited. We introduce LiveThinking, a practical two-stage optimization framework to bridge this gap. First, we address computational cost by distilling a 670B teacher LRM into a lightweight 30B Mixture-of-Experts (MoE) model (3B active) using Rejection Sampling Fine-Tuning (RFT). This reduces deployment overhead but preserves the teacher's verbose reasoning, causing latency. To solve this, our second stage employs reinforcement learning with Group Relative Policy Optimization (GRPO) to compress the model's reasoning path, guided by a multi-objective reward function balancing correctness, helpfulness, and brevity. LiveThinking achieves a 30-fold reduction in computational cost, enabling sub-second latency. In real-world application on Taobao Live, it improved response correctness by 3.3% and helpfulness by 21.8%. Tested by hundreds of thousands of viewers, our system led to a statistically significant increase in Gross Merchandise Volume (GMV), demonstrating its effectiveness in enhancing user experience and commercial performance in live, interactive settings.
Problem

Research questions and friction points this paper is trying to address.

Enabling real-time reasoning for AI livestreaming avatars
Reducing computational cost of large reasoning models
Compressing reasoning paths while maintaining response quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled 670B teacher into 30B MoE model
Used reinforcement learning to compress reasoning path
Achieved sub-second latency with multi-objective optimization
๐Ÿ”Ž Similar Papers
No similar papers found.
Yuhan Sun
Yuhan Sun
Ph.D. student of Computer Science, Arizona State Unviersity
GeoSpatial Graphdatabase
Z
Zhiwei Huang
Taobao & Tmall Group of Alibaba, Hangzhou, Zhejiang, China
W
Wanqing Cui
Taobao & Tmall Group of Alibaba, Beijing, China
S
Shaopan Xiong
ROLL Team of Alibaba, Hangzhou, Zhejiang, China
Y
Yazhi Guo
Taobao & Tmall Group of Alibaba, Hangzhou, Zhejiang, China
M
Meiguang Jin
Taobao & Tmall Group of Alibaba, Beijing, China
Junfeng Ma
Junfeng Ma
Taobao & Tmall Group of Alibaba, Hangzhou, Zhejiang, China