Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing LLM alignment methods struggle to achieve precise control over the strength of user-defined attributes. This paper proposes a goal-directed, fine-grained strength control framework: it formalizes strength regulation as a sequential decision-making problem, trains a lightweight value function—via temporal difference learning—to predict attribute strength during generation, and performs gradient-based backward intervention on hidden-layer representations to enable continuous, differentiable navigation of the LLM’s internal representations. The method requires no architectural modification or full-parameter fine-tuning. Evaluated on LLaMA-3.2-3B and Phi-4-mini, it achieves high-precision strength control (mean absolute error < 0.08) and significantly improves downstream tasks—including preference data synthesis, Pareto-front optimization, and alignment distillation—thereby overcoming the limitation of conventional alignment approaches that provide only directional guidance.

Technology Category

Application Category

📝 Abstract

Precise attribute intensity control--generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities--is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method's ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control

Problem

Research questions and friction points this paper is trying to address.

Achieving precise attribute intensity control in LLM outputs

Replacing directional guidance with target-reaching formulation

Enabling fine-grained continuous control via representation editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulating intensity control as target-reaching problem

Training value function via temporal-difference learning

Employing gradient-based interventions on hidden representations

🔎 Similar Papers

No similar papers found.