Project’s quantitative data available online
Blog content
A main objective of our project is to combine oral histories, archival searches and other traditional methods for historical research with quantitative data, so that we can develop a new approach to the study of the history of genomics. Our quantitative data is now freely available online after more than two years compiling, cleaning and structuring over 13 million records. This dataset documents the institutions that submitted yeast, human and pig DNA sequences to the European Nucleotide Archive and other open access databases between 1980 and 2015, indicating for each institution the number of both submissions and submitted nucleotides per year. It also lists the PubMed ID, authors and publication year of the articles that describe these sequences for the first time in the scientific literature. The source code of the software we used to compile the data can also be downloaded without restrictions.
The data collection process involved 30 million automated searches in the European Nucleotide Archive, Europe PubMed Central and Scopus. We needed to interlink the search results in a new fashion to create the datasets we sought: 13.4 million sequence submissions and 28,328 publications – more than 75% of these records are related to human sequencing. A data note describing the search strategy, cleaning protocol, design and structure of the dataset has been published in the open access and open peer review life sciences platform F1000Research.
In combination with qualitative historical knowledge, the dataset allows the identification of overlooked individuals and institutions in the history of genomics, as well as unknown connections between them. We are now analysing a number of co-authorship networks we have derived from the data with a view of publishing the results in a history of science journal.