This article features in our Festival of Genomics 2019 event guide. Grab a guide while at our Festival for more fantastic articles and interviews from leading life science experts, including Angela Douglas MBE!

Data integration has been one of the major trends of the last few years, and one which will become ever-more important as the life sciences sectors progress further. Successful integration allows a user to view and understand data sitting in two (or more) different sources or formats. But why is this important? And what challenges remain for those seeking to successfully integrate their structured and unstructured data? Dr. Maya Ghoussaini, Genetic Analysis Team Leader at the Wellcome Sanger Institute, and Dr. Denise Carvalho-Silva, Scientific Outreach Lead at EMBL-EBI, both working at Open Targets, here discuss the importance of this process, and their own work within the realm of data integration.

Recent technological advances have led to an unprecedented increase in the amount of biological data. Individually, this data provides unique, independent pieces of information but when integrated together, they reveal innovative findings and biological links which would not otherwise be possible, bringing valuable insights into health and disease research. Some examples of this data include sequencing and genotyping read-outs, gene and protein expression profiles across healthy and diseased tissues and/or cell types, CRISPR screens and other functional genomic information, drug compounds, and last but certainly not least, pre-clinical and clinical data.

The Importance of Data Integration

Data integration allows us to answer key biological questions and get profound insights into disease biology, relevant pathways and key biological systems. For instance, multi-omics data is crucial to constructing robust complex networks and to modelling biochemical systems: integrating single-cell transcriptomic data instead of the traditional bulk cell transcriptomic data helps identify sub-populations of cells that are most relevant to diseases. It can also lead to the discovery of genes and pathways which govern cell fate decisions and transitions. Integrating data from different sources is also useful in validating and replicating results from the different sources and enables better accuracy and precision.

Finally, integrating clinical data alongside genetic, genomic, metabolomic, transcriptomic and proteomic data in addition to lifestyle and environmental data represents a key step towards precision medicine. It also allows us to build new tools and predictive models to benefit patients individually.

Data Integration in Pharma

Drug development continues to be a highly inefficient, expensive and lengthy process, and identifying new potential drug targets and developing them into safe and effective medicine is a key priority for pharmaceutical companies.

Many pharmaceutical companies are closely partnering with academic researchers to integrate evidence coming from genome-scale experiments to help inform decision-making and adoption of new targets in drug development pipelines. The Open Targets initiative is one example of such collaboration where data is generated and integrated from whole genome experiments and data sources to identify and prioritise drug targets. The evidence includes germline and somatic mutations, transcriptomics data, clinical trials data and approved drugs, animal models, biochemical pathways and text mining from the literature. It is also important to emphasise that pharmaceutical companies are increasingly integrating genetics and functional genomics data into their drug development pipelines after robust evidence showed that genetically-supported drug targets are more likely to have drugs that succeed in the clinic.

In addition to discovering new medicines, platforms with integrated comprehensive data can be used to discover new uses for medicines that are already approved and marketed. This approach offers an attractive alternative to traditional drug discovery as it involves utilising already proven, safe compounds at much lower developmental cost and shorter development timelines. The abundance of genetic data such as human knock-out genes from exome and whole sequencing experiments, genome-wide association data, gene expression data, and interaction data between drugs, diseases and genes offer new opportunities to understand shared disease biology and generate new associations between diseases and existing drugs.

Finally, data integration can also help improve clinical trial efficiency through targeting a subset of patients based on genetic and genomic knowledge, thereby enabling smaller, shorter and less expensive trials.

The Challenges of Data Integration

What should you know before integrating biological data into a single resource?

Over the last three years, Open Targets has launched several resources and tools, including our flagships Open Targets Platform and Open Targets Genetics. We use human genetics and genomics data for systematic drug target identification and prioritisation, and these are some of the criteria we take into account before integrating new data:

—Relevance of the data

We assess how the new data can be used to associate targets with diseases, if it can suggest a causal link between a target and a disease and/or if it can allow drug target prioritisation depending on specific properties of the target.

—Seamless integration of the data

We assess how easily the data can be integrated into our existing pipelines, and therefore favour data that is described based on open standard ontologies, such as EFO, SO, UBERON, ECO. We also require that a standard identifier for the drug target (a protein or gene) is provided. This can be a UniProt ID or Ensembl gene IDs. This will avoid the need for term mapping. Finally, we take into account whether there is a score available for the new data, so that we can use it to rank the data points, and how this original score can be embedded into our scoring frameworks.

—Availability of the data

We assess whether the new data is open access and open source as we can only integrate data that is publicly and freely available. The new data should also be accessible through an API (preferably a REST-API), or files available for downloads. A REST-API or file dumps will facilitate the automation of the data integration.

—Sustainability of the data.

We assess how likely it is that the data will be sustainable in the medium- to long-term, and how frequently it is updated. This ensures that our resources will continue to evolve for years to come.

In Conclusion

Before considering embarking on a data integration journey, you may want to ask yourself: what will this new resource bring to researchers/users that no other resource currently does? What would be the unique feature(s) of your new resource? Can or should you collaborate, add value or improve well-established resources instead of creating your own?

The bioinformatics “resourceome” has been growing at an incredibly fast pace over the years and it can now suit a myriad of use cases and research domains in biology, medicine and translational research. These are exciting times, and we have just started tapping into the possibilities that data integration has to offer, and should begin to consider further collaboration and critical assessment of the need and sustainability of newly-created integrated platforms.

Maya Ghoussaini will be giving a talk on “Open Targets: Resources for systematic target identification and prioritisation for drug discovery” at 11:30 A.M. on the 24th of January, in theatre 1.

For more from the Wellcome Sanger Institute, come to our Innovation Showcase, where Dr. Fiona Behan is speaking on “Using Genome-wide Synthetic Lethal Screens to Identify Potential Novel Oncogenic Therapeutic Targets” at 11:00 A.M. on the 23rd of January, in the Live Lounge.