🤖 AI Summary
This study addresses the limitation of existing research, which predominantly focuses on coarse-grained life events such as birth and death, by introducing the first large-scale, fine-grained human life trajectory dataset. The authors extract 3.8 million (person, time, location) triples from Wikipedia and leverage contextual information to meticulously annotate them into 24 distinct categories of human activities. Methodologically, they propose a classification framework that integrates syntactic graphs with textual embeddings and further refine the original text using large language models to enhance syntactic regularity. Their approach achieves an accuracy of 84.5% on the 24-class activity classification task, substantially outperforming baseline methods. Both the dataset and the source code have been made publicly available to support future research.
📝 Abstract
Life trajectories of notable people convey essential messages for human dynamics research. These trajectories consist of (\textit{person, time, location, activity type}) tuples recording when and where a person was born, went to school, started a job, or fought in a war. However, current studies only cover limited activity types such as births and deaths, lacking large-scale fine-grained trajectories. Using a tool that extracts (\textit{person, time, location}) triples from Wikipedia, we formulate the problem of classifying these triples into 24 carefully-defined types using textual context as complementary information. The challenge is that triple entities are often scattered in noisy contexts. We use syntactic graphs to bring triple entities and relevant information closer, fusing them with text embeddings to classify life trajectory activities. Since Wikipedia text quality varies, we use LLMs to refine the text for more standardized syntactic graphs. Our framework achieves 84.5\% accuracy, surpassing baselines. We construct the largest fine-grained life trajectory dataset with 3.8 million labeled activities for 589,193 individuals spanning 3 centuries. In the end, we showcase how these trajectories can support grand narratives of human dynamics across time and space. Code/data are publicly available.