Worldwide researchers have created a set of guidelines to create, reference, and maintain web-based identifiers to improve the reproducibility, attribution and scientific discovery of “big data”. writes, this type of data is currently running the risk of being undermined by the poor design of the digital identifiers that tag data. The group, led by Julie McMurry from Oregon Health & Science University has published their findings in the open access journal, PLOS Biology, that helps address the frequent problems associated with persistent identifiers linked to scientific data.

As data has continued to evolve to become larger, more interdependent and natively web-based, the scientific research community has struggled to engineer this data for the web so that is is constantly accessible, reusable and attributable.

Depending on the individual database involved, identifiers can signify a gene, a genome, a chemical, an organism, a set of experimental data, or even a published article. The usefulness of all these items depends on the robustness and uniqueness of their respective identifiers, enabling them to be linked and discovered in perpetuity. The authors have explained that the organic way in which most identifiers have arisen threatens that usefulness, and recognise that it is difficult to create and sustain persistent identifiers or web addresses that won’t break and that are used consistently.

In short, professionals need to do a better job of indentifier engineering, so that data can be utilised more effectively for scientific discovery. In addition, users must become more aware of these conventions, and of available tooling, so they do not get burdened by broken links and missed connections.

“As with plumbing fixtures, the question of how identifiers work should only need to be understood by those that build and maintain them. However, everyone needs to know how identifiers should be used, and this is where convention is important,” explained McMurry. “Through this work, we hope to encourage all participants in the scholarly ecosystem – including authors, data creators, data integrators, publishers, software developers, and resolvers – to adhere to best practice in order to maximize the utility and impact of life science data.”