Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This study addresses the challenge of legally sharing copyrighted annotated corpora, which hinders natural language processing models from capturing the full diversity of real-world data. To overcome this limitation, the authors propose a corpus distribution mechanism based on non-invertible hashing: corpus creators publicly release hashed versions of both source text and annotations, enabling users to recover the annotations by aligning their own licensed copies of the text through the same hash function. The approach integrates text alignment algorithms with version-tolerant strategies to achieve robust cross-version matching and is implemented in an open-source Python toolkit named novelshare. Experiments on multiple editions of novels demonstrate token-level alignment accuracy ranging from 98.7% to 99.79%, confirming the method’s efficiency and practical utility.

Technology Category

Application Category

📝 Abstract

While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.

Problem

Research questions and friction points this paper is trying to address.

corpus distribution

annotated corpora

NLP

data sharing

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-reversible hashing

annotation alignment

NLP data distribution

text version robustness