🤖 AI Summary
Current multimodal large language models (MLLMs) predominantly employ input-agnostic, static guiding vectors, rendering them ill-suited to diverse safety and factual fidelity requirements across queries—e.g., refusing illegal requests versus facilitating medical consultations. To address this, we propose an input-dependent, fine-grained guidance method: a lightweight auxiliary module is trained to dynamically predict linear offset vectors in feature space conditioned on the input, enabling real-time behavioral modulation of the LLM during inference. This work introduces, for the first time, dynamic, input-specific guidance into MLLMs—departing from conventional static paradigms such as mean-based guidance. Training signals are derived via contrastive prompt generation, yielding an efficient, learnable input-to-guidance mapping. Experiments demonstrate substantial reductions in hallucination rates and marked improvements in safety response accuracy, consistently outperforming various static guidance baselines.
📝 Abstract
Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines.