🤖 AI Summary
To address modeling bias arising from semantic ambiguity, missing annotations, and heterogeneous quality in multi-source vehicular data for eco-driving, this paper proposes a domain-knowledge-embedded standardized data engineering framework. Methodologically, we establish a hierarchical data maturity assessment system and design a reusable four-stage pipeline—comprising understanding, cleaning, augmentation, and alignment—that integrates rule-based cleaning, lightweight active learning for annotation, spatiotemporal consistency verification, and driving-behavior-graph-guided data augmentation. Our key contribution lies in the first explicit standardization of the data engineering process and its deep coupling with traffic-domain constraints. Evaluated on real-world fleet data, the framework improves prediction accuracy of downstream energy-saving strategy models by 12.7% and reduces data preparation time by 64%. The pipeline has been successfully reused across three traffic AI projects.