We take a look at the advantages of securing a deeper access to data.

Pete White, Riveschl Professor and Chair of the Department of Biomedical Informatics, University of Cincinnati College of Medicine & Division Director of Biomedical Informatics at Cincinnati Children’s Hospital.

The past five years have played witness to a huge increase in the amount of data we are generating. We have a handful of impressive technological advances, and a step up from funders to thank for this. But, with such excitement comes hype, which can be very difficult to manage, especially in the public domain. Pete White is at the centre of managing these complexities, and so talks to us about some of the ways we should be dealing with these challenges, and how a focus on education could be the key to success.

FLG: The term ‘big data’ is a buzzword that we see a lot, and if I’m being honest it’s caused me some confusion in the past. How would you define this term?

PW: I often see this term being used in two ways. One is to refer directly to computationally intensive problems, where large—and often complex—amounts of data are being collected and need to be analysed. This requires new methods and can challenge our computing infrastructures. How I increasingly hear this term being used is as a wider reference to precision medicine in the modern era, where we can utilise this ever broader and deeper access to data to be able to develop new insights into some of our most complex challenges in biomedicine.

FLG: What was it that first inspired you to become a part of the data world, and led you to Cincinnati Children’s Hospital?

PW: I was initially attracted to “big data” through the Human Genome Project, which was one of the first efforts in biomedicine to really tap into the collective brainpower of a large number of scientists, and to perform research at scale. At the time, this was a somewhat controversial approach, but its success has launched a new era of team science, which is now a routine and essential way in which we approach genomic data challenges. The possibility of advancing more rapidly by sharing knowledge and flattening the landscape to allow broader participation of people and ideas was very exciting.

My move to Cincinnati Children’s three years ago was due to a similar attraction. It was apparent to me that the institution operated uniquely as a highly collaborative and cohesive community of scientists. In my time at Children’s, I have found that there are fewer barriers to success, and a collective consciousness fostered among our scientists and clinicians that is strongly supported by our Hospital leadership. Also, Children’s has invested substantially in infrastructure and its own basic science expertise, which is rare among paediatric institutions. This, together with a geography where the clinical and research space is co-located, as well as a strong history of mentorship, was a real attraction for my interests.

FLG: It is universally known that thanks to today’s capabilities, we are generating data on a much larger scale than ever before. But, with a lot of data relying on existing data, how important is it that we prioritise analysing the current data sets we have, rather than just churning out more?

PW: I consider the choice between legacy and prospective data to be a tough one. In my experience, legacy data is often fraught with challenges, such as variability in integrity or appropriateness for re-use. Often, issues of consent or intent—such as not having exactly captured the right data points—arise when considering legacy data for new projects. However, legacy data is often convenient from an access, utilisation, or cost standpoint. Conversely, prospective data collection allows one to precisely formulate a study based on immediate need, but it is often costly, both in terms of data capture and system development. In our projects, we’ve tried to assess what data is currently available and match it up as much as possible with the questions we are trying to answer; hopefully, at least some of the data we need is available somewhere.

FLG: As a society we are driven by short term gain to show instant impact. This makes it really difficult to show the wider impact of big data in healthcare. What challenges does this present to you, when it comes to explaining the progress made in this sector?

PW: Just yesterday, I was attending an advisory session held by a major funder. This group was deciding whether to renew a program that had demonstrated great success over a long period, and which had built up a great system that could continue to flourish if it was maintained with little change in direction. However, the clear funder priority was to do something “new and different”, much to the chagrin of the scientists assembled. So where is that balance where projects remain youthful and accepting of innovation that can be disruptive and even threatening, while maintaining stability and continuity that provides assurance when a high level of productivity is maintained? I actually think that we probably do a pretty good job of this in academia, but we can clearly learn more from our industry partners on how to be nimble and prioritise. I agree that well-validated and implemented science can be maddeningly slow, but the decisions made seem usually well reasoned. However, if we as researchers believe in the prudence of the scientific method, it is our duty to communicate the wisdom of this approach to others. As for me, it helps when I try to make the comparison of where we have been (not much data, little understanding) to where we are now (lots of discovery and many examples of healthcare impact), in addition to our tendency to always compare the present to some ideal future (all data, all knowing). In my opinion, while the latter comparison is a great driving aspiration, it’s not such a great metric for measuring value.

FLG: It has been five years since the initial gold rush of big data. If you were to take into consideration the real world impact of big data analysis over this time, how effectively have we integrated this into healthcare?

PW: I think that this depends on your perspective. There are increasing anecdotes about big data analysis that have impacted healthcare, but these are far fewer than clinicians or patients would like. There is undoubtedly hype that is difficult to manage, and this again speaks to the urgency for immediacy in which our society increasingly operates. However, if you consider a developmental biologist, they might consider new methods in single-cell -omics as a revolution, as RNA-seq, methy-seq, ATAC-seq, etc. allow incredible insights into normal and disease cell processes that we could hardly dream about years ago. Of course, it will take time for these to make their way into clinical practice. Big data, especially when it is generated in the context of discovery, is very complex, and thus takes time to discern, validate, and translate into clinical improvements.

FLG: A huge part of the analysis process is data sharing, of which funders play an instrumental role in. How would you evaluate the part of funders so far, in incentivising data sharing?

PW: Funders are key to this process, and I think they have really stepped up the practice of incentivising data sharing in the last few years, both with rewards to those who share, and penalties for those who don’t. We have seen mandates for including data sharing plans in grants for several years, but only more recently have these been enforced to any great extent. NIH’s recent endorsement of the FAIR (findability, accessibility, interoperability, reproducibility) data principles has really helped in this regard, as has patient interest in participating and benefiting from the research for which they are engaged. We can really accelerate progress in disease research for complex disorders if we share as much as we can, and as soon as we can.

FLG: Organisational barriers have continued to prevent big data adoption taking place. Why are these hurdles so difficult to overcome, and how best do you think they can be rectified?

PW: I think that this is difficult because there are multiple invested stakeholders when data derived from patients is produced. There is value and contributed effort from the participant, the researcher, the organisation that hosts the research, the funder, and the taxpayer. Institutions have a need for return on their considerable investments in sustaining and growing a costly environment for cutting-edge research. Also, especially for clinical research, they bear most of the compliance and security risk, which can be quite significant. That risk can be estimated, but it is much more difficult to estimate the risk—or opportunity cost—of not conducting the research, so institutions have a natural tendency to be conservative. In this competitive funding environment for research and clinical care, institutions are also pressured to reduce costs and find new revenue streams, and data is one asset that at least in theory can be monetised. It can help to bring all of these stakeholders together to develop shared understanding and trust, so the long-term gains can be recognised. Also, we can make progress by being more innovative in our faculty reward structures, so that data sharing and team science are highly cherished—we’ve made progress in recognising this value, but it is still not prioritised or valued as highly as independent research in most academic settings.

FLG: The public can be pivotal in encouraging data sharing, but there is a noticeable disconnection between them and clinicians. In what ways can this relationship be bettered, to add more value?

PW: Perhaps clinicians are better suited for participant and public engagement than most researchers, as their clinical training includes more focus on soft skills. There is a big need, and a growing awareness, of having researchers and clinicians consider engagement as a core competency. Patients and the public are invested in research outcomes and often process, but our biomedical systems often treat them transactionally—please consent for our study, give us a biosample, and perhaps you will hear something at some point. That’s not terribly motivating. An increasing number of successful studies have incorporated principles of customer satisfaction into their patient-oriented research, so that these vital stakeholders feel as if they are part of something meaningful. There are lots of ways to do this: participant-oriented study design, open houses and outreach events, gamifying the research or clinical experience, crowd-sourcing analysis, returning results, giving participants a choice in what is returned. Getting the public interested and literate in what we as scientists do is the first step towards meaningful engagement, and it can lead to fierce advocacy for specific disease research. We have actually talked about incorporating education of such skills into our clinical research training programs.

FLG: You have always placed importance on the value of literacy programmes when it comes to informatics. Why is it so important to push education like this to the forefront?

PW: I think that research effectiveness is directly correlated with literacy. Ideally, we want to maximise contribution of diverse and creative ideas that come from many perspectives. To do so in a meaningful way requires subject literacy. This is a big challenge in informatics, where there is often a real or perceived translation gap between computational, bench science, clinical, and community stakeholders. Team science is a nice concept, but it is most effective when there are efficient ways to understand perspective and share knowledge, and that requires literacy. We have approached informatics (and genomics) literacy by considering three types of learners: formal learners, mainly those who are data scientists (or genomicists) for a living; engaged learners, who could benefit from practicing informatics (or genomics) as a component of their craft; and interested learners, who could benefit from knowing what informatics (or genomics) is, how it is practiced, and why it is useful. Each group has different literacy needs and formalised or informal approaches to learning, and each can benefit from personalised academic advising that can develop learning pathways best suited to their situations. For example, pursuing a degree in our biomedical informatics PhD program might be right for some, but for others, just attending one of our interactive informatics focus groups, and possibly then our graduate certificate program over several years, might make more sense. Of course, this requires considerable effort and investment, but by considering everyone an informaticist—after all, everyone analyses data in some way—providing means to improve personal literacy strengthens capability, facilitates the creation of diverse teams, generates new ideas, and accelerates progress.

FLG: Cincinnati Children’s Hospital is a leader in paediatrics, and is a forward-thinker when it comes to big data analysis. Could you update us on what you have been getting up to over there?

PW: So many things! Here’s one project that I’m particularly excited about. Mayur Sarangdhar and a collaborative team in our group have taken advantage of the US FDA’s adverse event reporting data to build a system for mining drug toxicities. This system, called AERSMine, has harmonised over 10 million reports of clinical trial toxicities. Mayur has built an online system that allows a user to query between drug, drug indication, and adverse event, in order to quickly determine toxicities that are occurring more frequently than expected. This can be predicted for a specific compound or within a class of drugs. His system has been used very effectively for independently predicting toxicities for a number of drugs in active clinical trials. The promise is to better design trials in advance, so as to improve their chance for success. Mayur is now incorporating gene pathway data into his models in order to get a better sense of toxicity mechanisms.

FLG: Flash forward another five years, where do you hope to see the realm of big data?

PW: I’d like to see the advent of “precision wellness”. We are beginning to incorporate additional data sets into our computational models beyond the traditional genotype-phenotype correlations, such as environmental influences, behavioural data, and social data. Can we do this effectively, and perhaps even extend this further? Health and wellness are impacted by virtually all realms of society: economics, politics, lifestyle choices, entertainment. Data from each of these realms can be used as feature inputs in predictive models, which can lead to unanticipated new correlations, as well as more sophisticated models for multifactorial disease.