🤖 AI Summary
This work addresses format drift, evidence hallucination, and unreliable decision-making in structured prediction—challenges arising from label skew, heterogeneous group difficulty, and semantic ambiguity—by proposing a two-stage robust framework. First, a task-agnostic structured prompting strategy integrates XML instructions, disambiguation rules, and self-verification mechanisms. Second, the STaR-DRO optimization method combines Tsallis mirror descent with state-aware group loss reweighting, dynamically upweighting only persistently hard groups to avoid the instability of conventional exponential gradient reweighting. This approach uniquely unifies state-aware Tsallis reweighting with structured prompt engineering, achieving a zero-shot F1 improvement of 15.44 on average over the EPPC Miner benchmark; with Llama-3.3-70B-Instruct, it attains Code/Sub-code F1 scores of 81.47 and 69.30, respectively, and reduces validation cross-entropy by up to 29.6% for the most challenging clinical categories.
📝 Abstract
Structured prediction requires models to generate ontology-constrained labels, grounded evidence, and valid structure under ambiguity, label skew, and heterogeneous group difficulty. We present a two-part framework for controllable inference and robust fine-tuning. First, we introduce a task-agnostic prompting strategy that combines XML-based instruction structure, disambiguation rules, verification-style reasoning, schema constraints, and self-validation to address format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion in in-context structured generation. Second, we introduce STaR-DRO, a stateful robust optimization method for group heterogeneity. It combines Tsallis mirror descent with momentum-smoothed, centered group-loss signals and bounded excess-only multipliers so that only persistently hard groups above a neutral baseline are upweighted, concentrating learning where it is most needed while avoiding volatile, dense exponentiated-gradient reweighting and unnecessary loss from downweighting easier groups. We evaluate the combined framework on EPPC Miner, a benchmark for extracting hierarchical labels and evidence spans from patient-provider secure messages. Prompt engineering improves zero-shot by +15.44 average F1 across Code, Sub-code, and Span over four Llama models. Building on supervised fine-tuning, STaR-DRO further improves the hardest semantic decisions: on Llama-3.3-70B-Instruct, Code F1 rises from 79.24 to 81.47 and Sub-code F1 from 67.78 to 69.30, while preserving Span performance and reducing group-wise validation cross-entropy by up to 29.6% on the most difficult clinical categories. Because these rare and difficult groups correspond to clinically consequential communication behaviors, these gains are not merely statistical improvements: they directly strengthen communication mining reliability for patient-centered care analysis.