Large language models require a new form of oversight: capability-based monitoring

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional task-oriented monitoring approaches struggle to ensure reliability of general-purpose large language models (LLMs) in high-stakes domains such as healthcare. To address this, we propose a capability-oriented monitoring framework that shifts focus from task-specific performance to the evolution of foundational LLM capabilities—e.g., reasoning, summarization, translation, and safety alignment. Unlike existing methods reliant on fixed tasks or demographic assumptions, our framework decomposes and dynamically evaluates these shared capabilities across diverse applications, enabling scalable, cross-task assessment. Its key innovation lies in being the first to ground monitoring in capability-level dynamics, facilitating systematic identification of systemic flaws, detection of long-tail errors, and tracking of emergent behaviors. Evaluated in clinical LLM deployments, the framework significantly improves regulatory adaptability and human-AI collaboration. This work establishes a novel paradigm for safety governance of general-purpose AI systems.

Technology Category

Application Category

📝 Abstract
The rapid adoption of large language models (LLMs) in healthcare has been accompanied by scrutiny of their oversight. Existing monitoring approaches, inherited from traditional machine learning (ML), are task-based and founded on assumed performance degradation arising from dataset drift. In contrast, with LLMs, inevitable model degradation due to changes in populations compared to the training dataset cannot be assumed, because LLMs were not trained for any specific task in any given population. We therefore propose a new organizing principle guiding generalist LLM monitoring that is scalable and grounded in how these models are developed and used in practice: capability-based monitoring. Capability-based monitoring is motivated by the fact that LLMs are generalist systems whose overlapping internal capabilities are reused across numerous downstream tasks. Instead of evaluating each downstream task independently, this approach organizes monitoring around shared model capabilities, such as summarization, reasoning, translation, or safety guardrails, in order to enable cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based monitoring may miss. We describe considerations for developers, organizational leaders, and professional societies for implementing a capability-based monitoring approach. Ultimately, capability-based monitoring will provide a scalable foundation for safe, adaptive, and collaborative monitoring of LLMs and future generalist artificial intelligence models in healthcare.
Problem

Research questions and friction points this paper is trying to address.

Existing task-based monitoring fails for generalist LLMs in healthcare
LLM degradation cannot be assumed from population changes like traditional ML
Need scalable monitoring detecting cross-task weaknesses and emergent behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Capability-based monitoring replaces task-based oversight
Monitoring focuses on shared model capabilities like reasoning
Enables cross-task detection of systemic weaknesses
🔎 Similar Papers
No similar papers found.
K
Katherine C. Kellogg
MIT Sloan School of Management, Boston, MA, USA
B
Bingyang Ye
AI in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA; Department of Radiation Oncology, Brigham and Women’s Hospital/Dana-Farber Cancer Institute, Boston, MA, USA; Department of Computer Science, Brandeis University, Waltham, MA, USA
Y
Yifan Hu
Harvard John A. Paulson School Of Engineering And Applied Sciences, Cambridge, MA, USA
G
G.K. Savova
Computation Health Informatics Program, Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA
Byron Wallace
Byron Wallace
Associate Professor, Northeastern University
natural language processingmachine learningmachine learning for healthmodel interpretability
Danielle S. Bitterman
Danielle S. Bitterman
Harvard Medical School
OncologyNatural Language ProcessingArtificial Intelligence