Linking Data Citation to Repository Visibility: An Empirical Study

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how repository-level web visibility affects dataset citation rates. Using OpenAlex data, we integrate Sistrix domain visibility scores, repository h-indices, and citation metrics to conduct multivariate correlation and hierarchical regression analyses on datasets in social sciences and economics. We provide the first empirical evidence that domain-level web visibility—measured via search engine indexing—is significantly positively associated with dataset citations, exhibiting a threshold effect: the relationship holds only for datasets cited at least once; datasets hosted on highly visible domains receive, on average, more citations. In contrast, repository-level bibliometric indicators (e.g., h-index) show weak and inconsistent associations with citation counts. Our findings indicate that domain-level search visibility constitutes an effective, necessary—but not sufficient—lever for enhancing dataset citation, offering novel empirical support for optimizing data repository dissemination strategies.

Technology Category

Application Category

📝 Abstract
In today's data-driven research landscape, dataset visibility and accessibility play a crucial role in advancing scientific knowledge. At the same time, data citation is essential for maintaining academic integrity, acknowledging contributions, validating research outcomes, and fostering scientific reproducibility. As a critical link, it connects scholarly publications with the datasets that drive scientific progress. This study investigates whether repository visibility influences data citation rates. We hypothesize that repositories with higher visibility, as measured by search engine metrics, are associated with increased dataset citations. Using OpenAlex data and repository impact indicators (including the visibility index from Sistrix, the h-index of repositories, and citation metrics such as mean and median citations), we analyze datasets in Social Sciences and Economics to explore their relationship. Our findings suggest that datasets hosted on more visible web domains tend to receive more citations, with a positive correlation observed between web domain visibility and dataset citation counts, particularly for datasets with at least one citation. However, when analyzing domain-level citation metrics, such as the h-index, mean, and median citations, the correlations are inconsistent and weaker. While higher visibility domains tend to host datasets with greater citation impact, the distribution of citations across datasets varies significantly. These results suggest that while visibility plays a role in increasing citation counts, it is not the sole factor influencing dataset citation impact. Other elements, such as dataset quality, research trends, and disciplinary norms, also contribute significantly to citation patterns.
Problem

Research questions and friction points this paper is trying to address.

Investigates if repository visibility affects data citation rates
Explores correlation between web domain visibility and dataset citations
Examines factors beyond visibility influencing dataset citation impact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes repository visibility using OpenAlex data
Links web domain visibility to citation counts
Examines multiple citation metrics for datasets
🔎 Similar Papers
No similar papers found.
F
Fakhri Momeni
Knowledge Technologies for the Social Sciences (KTS), GESIS - Leibniz Institute, Unter Sachsenhausen 6 -8, 50667 Cologne, Germany
J
Janete Saldanha Bach
Knowledge Technologies for the Social Sciences (KTS), GESIS - Leibniz Institute, Unter Sachsenhausen 6 -8, 50667 Cologne, Germany
Brigitte Mathiak
Brigitte Mathiak
GESIS - Leibnizinstitut for Social Sciences
Information Retrieval
P
Peter Mutschke
Knowledge Technologies for the Social Sciences (KTS), GESIS - Leibniz Institute, Unter Sachsenhausen 6 -8, 50667 Cologne, Germany