🤖 AI Summary
This study addresses the challenge of legally sharing copyrighted annotated corpora, which hinders natural language processing models from capturing the full diversity of real-world data. To overcome this limitation, the authors propose a corpus distribution mechanism based on non-invertible hashing: corpus creators publicly release hashed versions of both source text and annotations, enabling users to recover the annotations by aligning their own licensed copies of the text through the same hash function. The approach integrates text alignment algorithms with version-tolerant strategies to achieve robust cross-version matching and is implemented in an open-source Python toolkit named novelshare. Experiments on multiple editions of novels demonstrate token-level alignment accuracy ranging from 98.7% to 99.79%, confirming the method’s efficiency and practical utility.
📝 Abstract
While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.