Textual semantics and machine learning methods for data product pricing

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the underutilization of semantic features in data product pricing by proposing a dual-path pricing framework that integrates textual semantics with machine learning. Methodologically, it systematically compares five text representation techniques—Word2Vec, Bag-of-Words (BoW), TF-IDF, GloVe, and BERT—and evaluates six predictive models—including linear regression, XGBoost, and neural networks—for two complementary tasks: continuous price prediction and price classification. The framework incorporates minimum-redundancy-maximum-relevance (mRMR) feature selection and SHAP-based interpretability analysis. Results show that Word2Vec achieves highest regression accuracy, whereas BoW and TF-IDF perform best for classification; semantic features exhibit strong explanatory power for price trends. The key contributions are: (1) the first empirical demonstration of differential efficacy across text representations under distinct pricing paradigms, and (2) a breakthrough in attribution-based interpretability for data product pricing via SHAP, providing both theoretical foundations and practical tools for market-oriented pricing of data as a production factor.

Technology Category

Application Category

📝 Abstract
Reasonable pricing of data products enables data trading platforms to maximize revenue and foster the growth of the data trading market. The textual semantics of data products are vital for pricing and contain significant value that remains largely underexplored. Therefore, to investigate how textual features influence data product pricing, we employ five prevalent text representation techniques to encode the descriptive text of data products. And then, we employ six machine learning methods to predict data product prices, including linear regression, neural networks, decision trees, support vector machines, random forests, and XGBoost. Our empirical design consists of two tasks: a regression task that predicts the continuous price of data products, and a classification task that discretizes price into ordered categories. Furthermore, we conduct feature importance analysis by the mRMR feature selection method and SHAP-based interpretability techniques. Based on empirical data from the AWA Data Exchange, we find that for predicting continuous prices, Word2Vec text representations capturing semantic similarity yield superior performance. In contrast, for price-tier classification tasks, simpler representations that do not rely on semantic similarity, such as Bag-of-Words and TF-IDF, perform better. SHAP analysis reveals that semantic features related to healthcare and demographics tend to increase prices, whereas those associated with weather and environmental topics are linked to lower prices. This analytical framework significantly enhances the interpretability of pricing models.
Problem

Research questions and friction points this paper is trying to address.

Investigates textual semantics' impact on data product pricing
Uses machine learning to predict prices via regression and classification
Enhances pricing model interpretability with feature importance analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Word2Vec captures semantic similarity for continuous price prediction
Bag-of-Words and TF-IDF used for price-tier classification tasks
SHAP analysis interprets feature impact on data product pricing
🔎 Similar Papers
No similar papers found.
R
Ruize Gao
Beijing Institute of Mathematical Sciences and Applications, Beijing, China
F
Feng Xiao
Beijing Institute of Mathematical Sciences and Applications, Beijing, China
J
Jinpu Li
Tsinghua University, Beijing, China
Shaoze Cui
Shaoze Cui
Beijing Institute of Technology
business analyticsmachine learningdata miningpredictive modeling