Solving Genomic Mysteries
Data is continuously being generated across the whole spectrum of genomics, and other –omic fields. The pace of innovation has become so rapid, that we are constantly raising new questions to investigate. In addition, we have also seen technologies from other data-centric fields find interesting applications within our own work. Our panel of experts go through the big questions being asked, and even ask some of their own.
Pete White, Riveschl Professor and Chair of the Department of Biomedical Informatics, University of Cincinatti College of Medicine, & Division Director of Biomedical Informatics at Cincinatti Children’s Hospital:
In these roles, Pete oversees informatics research and resources at both institutions, including leading the academic, educational, data services, technology development, and Research IT missions. As co-director of Cincinnati Children’s Center for Pediatric Genomics, he also serves in a leadership capacity for establishing enterprise-level solutions to genome-based precision medicine.
Mark Gerstein, Co-director of the Yale Computational Biology and Bioinformatics Program:
Mark has published appreciably in the scientific literature, with over 400 publications in total, including a number of them in prominent venues, such as Science, Nature, and Scientific American. His research is focused on bioinformatics, and he is particularly interested in data science and data mining, macromolecular geometry and simulation, and human genome annotation and cancer genomics.
Eran Elhaik, Lecturer, The University of Sheffield:
Eran’s recent work includes the development of the GPS technology that identifies and dates the origin of genomes and promotes a new understanding of cot death and mental disorders.
Asim Siddiqui, CTO, NuMedii
Dr. Siddiqui has over 20 years of experience in bioinformatics and software engineering. Prior to joining NuMedii as Chief Technology Officer, he served as Vice President of Product Development at Natera where he built Natera’s product management and program management functions. Dr. Siddiqui drove improvements to Natera’s NIPT diagnostic including its cloud strategy and cell free cancer program. Prior to that, he served as the Director of Bioinformatics at Life Technologies where he was awarded the Life Technology Inventor Award for his work on NGS bioinformatics.
FLG: With so much –omic data, real-world data, and electronic health records, what are the realistic short and long-term opportunities to really do something useful with all of it?
Pete White: Make no doubt, the –omic data that has been produced to date has already been transformational for many diseases. For
example, we’ve seen great advances in identifying pathogenic variants in thousands of rare genetic disorders, with many being either
actionable or providing clues to better diagnostics and biomarkers. However, the long-term opportunities allow us to truly understand disease in all of its molecular and phenotypic complexity. In coming years, we should be able to delve in at the single-cell level across the entirety of a disease course, which will provide us with better ways to model interventions, as well as to predict new therapies.When combined with clinical, environmental, and social data, this information will allow us to model disease in a highly specific way.
Mark Gerstein: In the short term, the goal is to get a better knowledge in genetics by analysing and annotating all the omics data, such as the gene regulatory networks, the functions of noncoding sequences, the relationship between genotypes and the phenotypes, etc. In the meantime, by integrating the real world data and electronic health records, people are working on developing personalised medicine and accurate prediction of potential diseases. In the long term, we would like to know the functions of all the genome components so that we can reliably predict any influence of genetic variants. Furthermore, we might get to know how DNA sequences lead to complex organisms like human beings and the evolutionary relationships between different organisms.
Eran Elhaik: I think that people have long awoken from the illusion that Big Data will provide long awaited answers and realised that, if anything, it raises new questions. Karl Popper essentially predicted this in 1934, claiming that the process of scientific discoveries is not systematic or methodological and requires talent, imagination, and intuition, which brings us back to the human factor and the need to train people who can connect all these data. In the short term we will probably see “Meta-datasets” – the stashing of datasets generated by different groups in a hope that their combined size would compensate for the disadvantages of batch effects and other biases.Personally, I am sceptical about such datasets. In the long term, we will see smart integrative systems that know how to correct for these biases and assimilate multiple inputs to produce clever predictions. Assim Siddiqui: In the short term, while there is a lot of data, it is poorly organised and structured. You can spend a lot of effort organising and cleaning data without much to show for it, so it is important to carefully define the question and identify tractable opportunities that can be realised quickly. There are numerous publications where through careful choice of datasets and focused questions, the authors were able to derive a result that validated in the lab. Some of these studies have resulted in the launch of commercial ventures such as our company. The failure rate in drug development is high so even small improvements can be significant from a patient outcome and financial perspective. In the long term, as we get better at structuring data and gather more of it, new types of analysis become possible and we’ll be able to make more precise informatics predictions and improve the success rate of drugs.
Register for day two of the Festival here!
FLG: A lot of this data sits in disparate sets. Without a consistent cohort across all of it, can it only be used in isolation?
PW: The more we can harmonise this data, the more we will be able to learn from it. Initiatives such as the UK Biobank and the All of Us program
have realised this notion and have tremendous potential as they move forward. Of course, there are many barriers—mostly self-imposed—that
are hindering our ability to develop such a consistent cohort. Historically, we have embraced a culture of distributed innovation, which has been
quite successful at solving certain biomedical problems, but has been much less effective at making progress for complex diseases. I think we
would be wise as a community to focus more on the social challenges that keep disparate datasets apart, such as the ways in which research is funded and awarded, as well as the considerable role that academic and clinical centres play in data accessibility.
MG: The human genome is so complex that it is regulated in multiple levels. It is essential to integrate different datasets, including genomic, transcriptomic, epigenomic data and even clinical results, to get a better understanding of any research interest. Various innovative algorithms and approaches are being developed to archive this end, such as network-based methods, multiple-kernel learning, multiple step analysis, etc. In our current
work, we are also trying to use the deep Boltzmann machine to interpret different datasets within a same framework.
EE: For the time being – yes, but only until more fundamental work is done to understand which of these datasets represent the “truth” and how to correct for the biases in the other datasets to take advantage of their measurements.
AS: Data can be combined with care. This requires close attention from those with expert knowledge in statistics and genomics and careful curation of the input data sets. This can be a manually intensive process and that’s why as I said earlier, it is important to define the question carefully so to not waste effort. Once data has been curated and integrated, it can be used multiple times and that increases its value.
FLG: What kind of impact can Machine Learning have on precision medicine, and
how far are we from that?
PW: I would argue that we have solidly been in the Machine Learning era for at least 10 years, but the acceleration of how these approaches have been used in biomedicine is truly remarkable. Machine learning, deep learning, AI, precision analytics—these currently ubiquitous buzzwords indicate that unsupervised, self-learning algorithms have been beneficial for an increasing number of researchers and biomedical problems. Their use will only increase as the data gets deeper and broader, as their predictive abilities are largely tied to the volume and disparity of data they are fed. This will have great impact on some of our most challenging problems, such as computational drug design and prediction, understanding disease course, and modelling organoids.
MG: Machine learning in statistical analysis is obviously very useful when dealing with large amounts of aggregated data. These techniques
collectively allow one to find subtle patterns in data that are not immediately apparent by eye. They’re particularly useful with the high dimensional omics data and phenotypic data where many different types of data sets are being put together. This capability of machine learning is essential for developing precision medicine. It can not only analyse different treatment results but also keep track of all the information of a patient, like his/her weight, age, blood pressure, etc.
EE: Machine Learning has the potential to boost the field of diagnostics. Take for example diagnosing rare diseases in children. What is extremely challenging for humans can be done faster and more accurately with Machine Learning applications, like Face2Gene, as long as the diseases are clinically well defined.
AS: We are already beginning to see the impact of ML (Machine Learning) on precision medicine in limited use cases e.g. radiology and ophthalmology image analysis. In our field, we use ML approaches to model response to drugs and have made discoveries in this manner. However, there is still need for human expertise and we are a long way until full automation of this process.
FLG: From the infrastructure side of things, how do you go about future proofing as best you can, so you don’t end up having to reinvest
in a new set up a few years down the road?
PW: This is an especially tough challenge, as the pace of innovation necessitates that technologies will develop more rapidly over time. We know the data deluge will only strengthen in intensity, and there will always be the need to scale data compute, storage, and exchange as fast as possible. I do think that these things tend to be in balance with other factors. For example, look at the Internet—there is always the desire and push to make it faster than it actually moves, and to make browsers better in order to handle, for example, high definition video. But we have moved quite quickly, and the rest of the scientific process also needs to evolve. We need to invent better ways to use and to ask relevant questions of big data. We also need to determine how to resource larger infrastructures and that tends to play a substantial role in modulating progress.
MG: One way of dealing with the rapid obsolescence of technology is to rent the infrastructure as opposed to buying it. This is one of the main impetuses behind cloud computing, where one is essentially renting cycles or purchasing them on the fly rather than buying computing equipment.
EE: This is a question of vision and habits. There is never enough time to do things right the first time and never enough money to do them right the second time. Unfortunately, it is very common to prefer tech-gurus (maybe because they are harder to understand) over the expert scientists only to hear a year down the road that an application is unfeasible because the system was not designed to support it. By then, of course, the tech-guru has already left
for a better job and we are stuck with a dysfunctional system. To minimise costs in the climate of uncertain funding and rapid
innovations, infrastructure should be as centralised as possible.
AS: You scale the parts of the infrastructure that are stable keeping everything else nimble. Our algorithms need to remain nimble, but basic concepts such as genes and proteins remain largely constant. Also, it is important to keep an eye on your end customer of the infrastructure. For us, the end customer is internal and as such can tolerate changes more easily than an external one and so not overbuilding the system is another means of handling this. Hardware systems are another source of change and updates. In today’s world, this has been greatly simplified for consumers of hardware through the use of cloud technologies. Cloud vendors innovate rapidly and have made it easy for us to transition to new hardware and try new technologies such as GPUs without large investments.
FLG: There’s a lot that has, and continues to be, said about the looming tsunami of data. At what point does the conversation shift from ‘how are we going to manage all of this data, effectively?’ to ‘how are we using this data in our workflows?’
PW: I think this somewhat depends on perspective and the magnitude of the problem. We’ve had a tsunami of data for some time but have managed to use the data quite well in certain cases, such as in the ENCODE and TCGA projects. I think the difference now is that more people are affected, to the point that everyone in the space needs to have solutions. Hospitals, academic institutions, disease groups, educational programs, research funders—those organisations that realise that they are in the data business, and accordingly plan for enabling their constituents to use data effectively, will be best positioned for success.
MG: On the tsunami, one of the key issues in dealing with largescale data is dealing with progressive summarisation of the data. For instance, for images one often does not deal with the raw files, but deals with compressed JPEGs and MPEGs. Likewise, for human sequences, instead of dealing directly with BAMs and reads, one might often deal with VCFs and eventually even just haplotypes. Similarly, for the sequence annotation, one would look
at peak files and various annotated blocks, such as enhancers, rather than looking at raw signal tracks or the result of an RNAseq experiment. These summarisations are a much lighter way than the raw data and much easier to deal with. In this way, we do not need to worry about the data management problem but focus on using them. In addition, they often have much of private information shorn.
EE: I am not sure such a shift would occur. The rate of data generation is a direct outcome of technological maturity and new technologies allowing the generation of new data types, like single cell or metagenomes. This will not stop just like the need to solve crimes and understand the molecular mechanisms of complex disorders will not disappear. Whether we are utilising the best possible resources to address these problems is a question for reviewers. Perhaps it is worth discussing how to best inform the community of the availability of these datasets and their maturity?
AS: It already has (or should have). A tsunami is not a wave of data with ebb and flow but a constant on-going increase in data. This is the new normal, and that’s a good thing. To create better models of biological system, we need more data. Effective organizations have already determined how to use this data and incorporate it into their workflows. Those that have not need to partner with others else they risk getting left behind.
FLG: Ownership of data is something that gets brought up by the public frequently. Should patients and consumers own their own data, or does it sit best at an organisational level?
PW: Research should be an inclusive exercise, where all stakeholders contribute for greater awareness and better outcomes. The concept of learning health systems has not yet impacted greatly on genomic research, but the principles of coproduction and trust engineering apply quite well to genomics,
especially clinical genome diagnostics and therapeutics. I believe that patients and the community have important strengths that researchers and clinicians can better leverage. Stronger community engagement, including in underserved areas, can
bring new ideas, establish partnerships, and educate the public about the value of genomics. As this occurs, we should see positive evolution of our social understanding of genomics, and as a result, many of the ownership issues will begin to dissipate.
MG: The ownership of data is a complicated issue. Essentially it makes sense for the individuals to own their own data. However, researchers like me are getting many data sets from many individuals and constantly processing them and being asked to reuse data in various formats. It’s very complicated to do so over many individuals and one would have to imagine some sort of intermediate program mechanism to be built so that the data can be available for research but also keep the individual’s privacy intact.
EE: I believe that people have the rights to their own data and that only they can decide what to do with them.
AS: Data from a single patient has no (or very little) value. What has value is the amalgam of integrated data from many individuals, and the quality of questions being asked of the data. There needs to be more outreach to the public to help them understand our only hope of tackling complex disease such as cancer is for the data to be shared. This needs to be brokered by a trusted party, but accessible in some confidential manner to all. Whether the consumer or organisation owns the data is less important to companies like ours than is there a straightforward manner for us to access it to enable discovery.
FLG: What needs to be considered when dealing with data policy issues, and can technologies such as block-chain help with this?
PW: Data policies will need to continue to evolve. As compliance and privacy footprints grow, the challenge for the research community will be how to best balance protection with utilisation for appropriate risk. To date, IT and compliance groups have focused more on protection, in large part because minimal or no risk solutions are more easily quantifiable and justifiable than those with “acceptable risk”, which is impossible to precisely define. At some point, this may make research intractable to conduct. It may make sense to better define what risks are acceptable, and to work with each other to come to consensus. While technologies like block-chain will help, ultimately it is human behaviour that poses that greatest risk and need to adapt.
MG: Block-chain is a popular new technology for keeping a secure ledger. However, it is not well-suited to directly interact with large amounts of private data, such as the reads and BAMs from genome sequencing. Nevertheless, it might be useful for tracking modifications to these files, as far as looking at the metadata. One can imagine a metadata history of a file being dealt with in terms of a block chain.
EE: There are many issues to consider, but I believe that the most important ones are that the policy should be short and understandable and explain the risks in donating data. Block-chain technologies can help in protecting and securing the data and I can envision that they will be adopted for sensitive data.
AS: Data policy questions centre around ownership of the data and data privacy. Data policies can help formalize the rules for sharing and technology can help implement those rules, but more fundamentally there needs to be an understanding at a societal level of the value and application of this data. There is a lot of hype around block-chain right now. It has the potential to help, but like all new technologies must be evaluated on its own merits.
FLG: As much as it is a technical challenge, data sharing is also a social challenge. How can you get people into the habit of wanting
to share their data?
PW: In my experience, the best way to get people to share is to understand their particular uncertainties about sharing. These may vary—desire for attribution, perceived compliance risks, effort expenditure. Understanding these uncertainties can help to determine a behavioural economic strategy to overcome obstacles. Opportunities for direct engagement of data providers in the process—such as participating in developing standards, involvement in pilots utilising data that has been aggregated together, acknowledgement of participation— can help. Finally, focus on those who do share, as the sharing will drive the research and capabilities forward; don’t dwell on those who do not or cannot, which can be an exercise of diminishing returns.
MG: Obviously, data sharing has issues related to data ownership, privacy and so forth, which are general societal questions and have to
be dealt with. One of the key issues here is to keep the individual from being harmed by the sharing and this requires various protections. Obviously, there is health insurance protection for genomic data sharing but there might be also various social protections against stigma, which would be important for sharing a full description of an individual.
EE: One of the very first written texts was a self-promotion message written by King Hammurabi. This is no different than any animal in nature showing its strength, beauty, or skills. However, after that peacocking message came the first laws, which separated us from the animals. So while sharing may be ingrained in our DNA, perhaps more strict laws are the solution.
AS: That’s exactly right. I see it as more of societal challenge than a technological one. Once we figure out the rules, the technology piece is (relatively) easy. Outreach is key. We must be able to explain to the public why this data is needed, its value and the potential for societal good. Understanding that any individual piece of data has low value by itself is key. We also need to put in place the right rewards for people to share their data. When people share social information on site like Facebook, they do so because it provides them with a value they recognize. We need to determine the equivalent structures for medical data that will promote sharing and disclosure of their medical records.
FLG: What are you most excited about for the future?
PW: I think the day is coming when a researcher can go to their computer and, with the proper authorisation, be able to instantaneously query across all sizeable molecular, clinical, and environmental datasets generated with public funds. We are working towards that vision here at Cincinnati Children’s with a project called VIVA, which aims to provide a catalogue of all molecular, biomaterials, imaging, and clinical data across our academic health center. Our aim is to organize our data at the institutional level, rather than just by disease or cell type, and link this with similar data compilations from other institutions. I’m excited about the new possibilities that this data immediacy and democracy will bring, in terms of discovery potential, shortening the R&D cycle for improved health, and improving team science cohesion to tackle our most pressing and complex disease challenges.
MG: The most exciting part about the future is the application of advanced AI technology in the genomics field. It will help us to tackle all the questions above and solve the mysteries in genomics. Eventually it would help us find treatments for diseases that are currently incurable, automatically monitor the dynamic change of individual patients and adjust their therapy accordingly.
EE: I am terrified about the future. I do hope to see more women in science and more children interested in STEM.
AS: The ability to create and test predictive models that can be used to accelerate drug discovery and drug validation/approval in an accelerated manner. With more data, our capabilities will only improve to the point of reducing and one day potentially even removing the need for large clinical trials with most results being computed in silico.