TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address critical limitations of vision-language models (VLMs) in embodied intelligence—including poor generalization in dynamic environments, low sample efficiency, inconsistent reasoning, and post-fine-tuning degradation—this paper proposes Thought-Centered Preference Optimization (TCPO), a novel preference-based framework centered on reasoning traces. TCPO is the first to apply preference learning to chain-of-thought (CoT) optimization, constructing fine-grained pairwise samples over individual reasoning steps and introducing an Action Policy Consistency (APC) constraint to jointly optimize reasoning processes and action decisions. By unifying preference learning, reinforcement learning, and CoT modeling, TCPO achieves an average task success rate of 26.67% on ALFWorld—outperforming RL4VLM by 6 percentage points—while significantly improving model stability, response latency, and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing sluggish responses and hallucinations in dynamic embodied AI environments
Overcoming sparse rewards and action-only optimization in post-SFT methods
Mitigating model degradation and improving consistency in decision-making processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise preference optimization for richer rewards
Aligns intermediate reasoning to reduce model degradation
Action Policy Consistency Constraint for output consistency