🤖 AI Summary
This work addresses the challenge of limited controllability in unsupervised text style transfer, particularly when fine-grained style intensity distinctions are required in the absence of parallel corpora and when adjacent intensity levels exhibit subtle differences. To tackle this, the authors propose a two-stage training paradigm: first performing supervised fine-tuning (SFT) on a large language model using synthetically generated parallel data, followed by reinforcement learning via proximal policy optimization (PPO) with a novel hierarchical intensity-aware reward function. This reward mechanism uniquely integrates both global and local style features, enabling precise, fine-grained control over style intensity. Experimental results demonstrate that the proposed approach significantly outperforms existing methods on two unsupervised text style transfer benchmarks, effectively generating stylistically distinct outputs even for closely adjacent intensity levels.
📝 Abstract
Unsupervised Text Style Transfer (UTST) aims to build a system to transfer the stylistic properties of a given text without parallel text pairs. Compared with text transfer between style polarities, UTST for controllable intensity is more challenging due to the subtle differences in stylistic features across different intensity levels. Faced with the challenges posed by the lack of parallel data and the indistinguishability between adjacent intensity levels, we propose a SFT-then-PPO paradigm to fine-tune an LLM. We first fine-tune the LLM with synthesized parallel data. Then, we further train the LLM with PPO, where the rewards are elaborately designed for distinguishing the stylistic intensity in hierarchical levels. Both the global and local stylistic features are considered to formulate the reward functions. The experiments on two UTST benchmarks showcase that both rewards have their advantages and applying them to LLM fine-tuning can effectively improve the performance of an LLM backbone based on various evaluation metrics. Even for close levels of intensity, we can still observe the noticeable stylistic difference between the generated text.