Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

📅 2026-04-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This study addresses the challenge of legally sharing copyrighted annotated corpora, which hinders natural language processing models from capturing the full diversity of real-world data. To overcome this limitation, the authors propose a corpus distribution mechanism based on non-invertible hashing: corpus creators publicly release hashed versions of both source text and annotations, enabling users to recover the annotations by aligning their own licensed copies of the text through the same hash function. The approach integrates text alignment algorithms with version-tolerant strategies to achieve robust cross-version matching and is implemented in an open-source Python toolkit named novelshare. Experiments on multiple editions of novels demonstrate token-level alignment accuracy ranging from 98.7% to 99.79%, confirming the method’s efficiency and practical utility.

Technology Category

Application Category

📝 Abstract
While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.
Problem

Research questions and friction points this paper is trying to address.

copyright
corpus distribution
annotated corpora
NLP
data sharing
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-reversible hashing
copyright-compliant corpus sharing
annotation alignment
NLP data distribution
text version robustness
🔎 Similar Papers
2024-06-17North American Chapter of the Association for Computational LinguisticsCitations: 2