🤖 AI Summary
This work addresses the challenge of anticipating bias and safety risks in large language models (LLMs) that arise from seemingly benign training data, which are difficult to detect prior to model training. To this end, the paper introduces the Data2Behavior task, enabling, for the first time, pre-training prediction of model behaviors induced by non-explicitly harmful data. The authors propose a lightweight approach termed Manipulating Data Features (MDF), which injects statistical data features—such as mean embeddings—into the forward pass of a base model without requiring fine-tuning or parameter updates, thereby effectively surfacing latent risk signals. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it demonstrate MDF’s reliability, achieving accurate prediction of unintended behaviors while consuming only approximately 20% of the GPU resources required for full fine-tuning.
📝 Abstract
Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.