MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing approaches to enhancing the trustworthiness of large language models (LLMs)—including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), prompt engineering, and conventional representation engineering—suffer from high computational cost, poor robustness, or reliance on manual sample curation and static strategies. Method: We propose the first end-to-end automated representation engineering framework, featuring (1) a multi-agent system that autonomously generates high-quality guiding examples, and (2) a dynamic anchor-oriented vector mechanism enabling context-aware, adaptive vector guidance. Contribution/Results: Our method requires no model fine-tuning, is lightweight and computationally efficient, and improves LLM trustworthiness by 15.36% on LLaMA and 4.21% on Qwen—outperforming all baselines significantly—while preserving generalization capability, robustness, and intrinsic model functionality.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training-free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end-to-end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester, a multi-agent system that generates diverse, high-quality steer samples tailored to developer needs; and AutoRepairer, which constructs adaptive steering strategies with anchor vectors for automated, context-aware strategy selection during inference. Experiments on standard and customized trustworthiness tasks show MASteer consistently outperforms baselines, improving metrics by 15.36% on LLaMA-3.1-8B-Chat and 4.21% on Qwen-3-8B-Chat, while maintaining general model capabilities. MASteer demonstrates strong robustness, generalization, and practical value for scalable, efficient trustworthiness repair.

Problem

Research questions and friction points this paper is trying to address.

Automated flexible repair for LLM trustworthiness issues

Overcoming costly slow existing repair methods

Enhancing representation engineering with automation adaptability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent system generates diverse steer samples

Adaptive steering strategies with anchor vectors

Training-free representation engineering for LLM repair

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

Research Engineer - AI Trust - Meta Superintelligence Labs