top of page

News Feature: Sharing Research Data

November 08, 2022, #20

At the ASIS&T Conference in Pittsburgh, Pennsylvania, USA, Sara Lafia presented “A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature” on behalf of her co-Authors Lizhou Fan and Libby Hemphill, all from the University of Michigan. Their research uses “natural language processing”¹ to link “thousands of social science studies to the data-related publications in which they are used.”¹ They used their own “Entity Recognition (NER) model”¹ to detect informal references with the goal of “connecting items from social science literature with datasets they reference.”¹ Their main source of data came from ICPSR (the Inter-university Consortium for Political and Social Research), which has data from its founding in 1962 They begin by parsing “full text PDFs into structured text documents”¹, which “retains headers, section titles, tables, figures, and footnotes where data references are likely to be found.”¹ Then they applied their NER in order “to detect dataset entities”¹. As Sara Lafia noted, this is very labour intensive work when done by hand, and makes good use of the thousands of ICPSR datasets. The process lets them “address gaps caused by inconsistent data citation practices and make it possible to detect and analyse data references at scale.”¹ They use human-in-the-loop feedback in order to improve accuracy. In the end they will have created a substantial bibliography of “informal references to research data”¹ that could otherwise easily be overlooked. In the paper’s conclusions, the authors make some key observations, including the fact that “awareness of how users interact with data throughout its lifecycle can support the development of data-driven curation and collection policies.”¹ Developing such policies may seem straightforward, but we have in fact far too little information about how users use data in large-scale archives like ICPSR. They also note that better metrics will not only “better represent the diversity of data use”¹ but “may also inspire novel reuse in the long-term”¹. It is easy to underestimate the importance of this reuse. As the scholarly community increasingly emphasises the value of long-term access to research data, understanding of what the data are and how they have been used is invaluable.


1: Lafia, Sara, Lizhou Fan, and Libby Hemphill. 2022. ‘A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature’. Proceedings of the Association for Information Science and Technology 59 (1): 169–78.


Recent Posts

See All


bottom of page