🤖 AI Summary
Efficiently accessing internal states during large model inference is hindered by high latency and limited flexibility. This work addresses these challenges by introducing internal observability as a system-level primitive and proposing Ring², a GPU-CPU memory abstraction that decouples observation from the inference hot path through asynchronous tensor capture and a policy-driven host-export backend. The design is compatible with mainstream inference frameworks, supports flexible placement of observation points, and satisfies both service performance requirements and GPU memory constraints. Experimental results demonstrate that the approach incurs only 0.4%–6.8% overhead in offline batch processing and increases average latency by merely 6% in online serving—reducing latency overhead by 2× to 15× compared to existing solutions.
📝 Abstract
Today's inference-time workloads increasingly depend on timely access to a model's internal states. We present DMI-Lib, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI-Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI-Lib incurs only 0.4%--6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x-15x compared to existing baselines with similar observability features. DMI-Lib is open-sourced at https://github.com/ProjectDMX/DMI.