🤖 AI Summary
This study addresses the instability and high failure rates of large language models (LLMs) in MAPDL-based finite element simulation, stemming from a lack of structured control, tool encapsulation, and fault tolerance. To overcome these limitations, the authors propose CAX-Agent, a lightweight agent middleware featuring a three-tier architecture—comprising an LLM service layer, an agent middleware layer, and a solver backend—that enables reliable task orchestration. A novel recovery-ladder fault-tolerance mechanism is introduced, progressively escalating from rule-based repair and model regeneration to context enhancement and, if necessary, human intervention. Evaluated on 50 structural benchmark cases, the model-only strategy achieves a 92.67% task completion rate, an average score of 3.59 out of 4, and an 84% zero-intervention rate, significantly outperforming baseline approaches (Cliff’s delta = 0.81–0.87), thereby demonstrating the robustness and effectiveness of the proposed method.
📝 Abstract
Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation. This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components -- the recovery policy.CAX-Agent organizes execution into three layers -- LLM service, agent harness, and solver backend -- with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen's kappa = 0.84, 96 percent of score pairs within one point). Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff's delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.