🤖 AI Summary
This work addresses the lack of behavioral traceability in large language models (LLMs) trained across multiple stages, which hinders debugging and leads to recurrent issues. To tackle this challenge, the authors propose DebugLM, a novel framework that endows LLMs with built-in data provenance capabilities. By assigning unique identifiers to training samples, DebugLM enables the model to associate its inference-time outputs with specific data sources and supports precise, tag-based rejection mechanisms. Crucially, this approach allows targeted correction of undesirable behaviors during testing without requiring model retraining. Experimental results demonstrate that DebugLM accurately traces the origins of model behaviors across multi-stage training pipelines, effectively mitigates problematic outputs while preserving general performance, and substantially enhances the observability and controllability of large language models.
📝 Abstract
Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.