Data From 1 Million
The All of Us Research Program is a key element of the Precision Medicine Initiative (PMI), which launched back in 2016, when $130 million was allocated to the NIH to build a national, large-scale research participant group. Two years on, we take a look at what they have achieved, and how they plan on integrating their research into the U.S. healthcare system over the next ten years.
Dr. Joshua Denny is Professor of Biomedical Informatics and Medicine, Director of the Center for Precision Medicine and Vice President of Personalized Medicine at Vanderbilt University Medical Center. His work largely focuses on the use of electronic health record data for discovery and implementation of precision medicine. He is principal investigator of the All of Us Research Program Data and Research Center, which will enrol at least 1 million Americans in an effort to understand the genetic, environmental, and behavioural factors that influence human health and disease. He is also principal investigator of nodes in the Electronic Medical Records and Genomics (eMERGE) Network, Pharmacogenomics Research Network (PGRN), and the Implementing Genomics into Practice (IGNITE) Network.
FLG: For those who aren’t aware, could you please tell us the main mission of the ‘All of Us Research Program’?
JD: It is to create a very robust, and engaged group of participants that as a platform will enable help and discovery for prevention, treatment and reduction of health disparities, as well as provide a more rational use of therapeutics, and a better understanding of prevention strategies and understanding of disease.
FLG: We are focusing on big data during this issue. From my understanding, the program has its own data and research center. What is your role in this, and could you talk me through the processes that happen within it?
JD: I am one of the official investigators for the data and research centre which we commonly call, the DRC. Our job is to house all the data for ‘All of Us’ and make it useful for research. Basically, that means curating the data, pulling it together, cleaning it up and making it available for researchers with a variety of tools, so that they can make good use of it. This is done whilst also upholding a high degree of trust with our participants in how we manage that data and make it available. You can really track a lot of this back to when we were first devising the program, after it was initially announced as a concept by President Obama. A working group forum was set up to write a document about how we would design ‘All of Us’. Given the state of where we were going we wanted to make the data most available to researchers and enable the broadest use, and maintain greater security as we centralised the data. This would make it a lot easier and faster to do research studies as well, because a lot of the existing research studies out there, if you exclude things like the UK Biobank, and focus on the United States, have actually used what is called a federated model. This is where a researcher asks a question asking a whole bunch of different sites to do the analysis, it doesn’t always happen at the same time, and because they deploy differently, you can’t harmonise the data in the same way. This often takes a lot longer, and can hinder research and research quality. We thought centralising and building a strong data infrastructure; with the ability to apply state of the art tools, and enforce security policy would enable faster research that would produce better outcomes. This means more confident results, whilst also maintaining a higher security standard.
FLG: How would you describe the programs progression since it first launched, especially alongside the rapid advances within genomics and data generation as a whole?
JD: The grants for our program were funded in July 2016. We had a bit of the blueprint from the working group report that I was part of. We had to create and build everything, including all the technology, as well as a protocol that could be approved by our board. We also had to establish the physical fight in enrolling in the Biobank that could collect samples, shipping across the United States, even the logistics were a tremendous amount of work. We were then able to enrol our first person on 31st May 2017. We have slowly been bringing on new sites, expanding the infrastructure and the computing environment, through this data period we probably have close to 100 clinics that are enrolling individual sites and individuals as we go to that national infrastructure. The Biobank had a lot more capacity to actually handle patient samples; they are continuing to build along with everything else. We on the data side have built all the initial pieces to take all the ingestions of data from different places, and start to put that together. We are still actively building infrastructure to support the data curation, and building tools for researchers to get access to the data. That’s our primary focus right now, as we build a lot of that raw data capture. Our basic design is to build a system to capture all the raw data and handle all the identity management and capture all the clinics across the country to input data in the same way, and then build systems on top of that, and curate it to make it available for researchers. That is the stage we are at now, working on that next stage to make the data we have useful.
FLG: You must rely heavily on technology providers to ensure you are able to generate data. How do you approach key players you’d like to work with, and what goals in particular have they allowed you to meet?
JD: We as a data and research centre are a partnership, Vanderbilt is the grant recipient and then we have principal investigators at Verily, formerly Google Life Sciences and the Broad Institute. The Broad Institute has a lot of experience with large scale genomic policies and high throughput sequencing, and genotyping. Verily is really experienced with the cloud, especially the initiation of the Google cloud. They have a lot of experience with managing huge datasets and running algorithms at scale, and making data quickly accessible, and we are leveraging a lot of those kinds of technologies. In fact, at the Broad, they found that they couldn’t manage their data well and their compute environment has now moved to the cloud, on a similar Google platform some time ago. We are moving it in with the phenotypic data and building tools for everyone to use, instead of just the Broad that have a particular set of users. We harness a lot of the different technologies, and a lot of the stuff we are getting more on our side and with Verily is Machine Learning and Deep Learning approaches. The challenges of doing a lot of simple things with a lot of data can be hard, and not to mention the next generational technologies.
FLG: With such a large pool of data, how do prioritise your time to analyse existing data, against the need to keep generating more?
JD: We are still in our data phase, and so we still have a small population of individuals that will come through, but we will have a lot more data coming through as the program enters its national launch this spring. In addition, we are just starting to get a lot co-record data, and we have plans now to get genomic data. That has been launched by NIH and will come out in established genome centres, probably by early 2019. We are currently building our system, we have created synthetic records based on real valuable data so that we can basically test and develop the system with the type of data we are going to have with the whole cohort. We are developing with 1 million Vanderbilt synthetic health records. We will also be doing similar things with the testing on the genomics side. As we get data, a really important thing to do is an analysis of it as you get it, especially with electronic health record data, in order to look at its quality and try and find some of those common errors. What we find is people have data after death, a lot of these things happen in a real world system, but we can create a series of rules to help us identify these kinds of problems. So, as we get data we are actively doing that curation in real time as well as the analysis, and on the survey data we are getting. We have just completed our first batch of duplicating that data to make sure we are getting what we think. This is going to be a process, it is not a destination,, but it is something that we will keep on getting better with over time.
FLG: Ultimately, the goal of the program is to implement your datasets effectively into healthcare. How have you begun doing this, and what hurdles remain?
JD: The knowledge that we learn from this research project, is the part that will be implemented into healthcare. I would say that we are still very much in the infancy of using genomic medicine, in the real world. I think Genomics England is a real beater in that in showing its value, and in general cancer care is probably ahead of other fields. We haven’t really begun to understand how to use other kinds of technology that we will be gathering in this programme, like exome monitors and activity monitors, fit bits and apple watches. We don’t really know how to use those in healthcare, and I think we will add to those discussions. I can tell you personally at Vanderbilt, that we have been engaged in applying genomic medicine from the perspective of drug prescribing, and so we have acted systems that we have built, that basically take genomic data and put it aside the electronic health record, because they aren’t designed for that kind of data. So, we put in the actionable stuff that we definitely know matters into that person and we keep the rest of it, so as we need it we can pull it out. That has been our model so far, that may change as the ?? become more sophisticated, but that’s where we are today.
FLG: We have worked in the past with the Qatar Genome Project, as well as the 100,000 Genomes Project, to highlight the importance of generating a diverse representation of data. Why is this so crucial to the program’s success?
JD: One of the strengths of America is that we have a diverse population and I’m sure you are well aware that in the prior genomic studies, 82% are of European ancestry and only about 4% of the existing GWAS studies in a 2016 study covered about a third of the US population. Some of the diversity we have in the US is really underrepresented amongst others in genotypes so far. I think we have an opportunity to really engage diverse populations which will lead to new discovery. One of the things I point to is cholesterol medications, the pcf9? Inhibitors which have a dramatic effect, like blood draw that were discovered through sequencing African ancestry populations. Similarly we found that drug response varied quite dramatically amongst different ancestries and so I think there are real opportunities to learn from studying diverse populations. We will learn more about what matters, discover new potential drug targets we wouldn’t have discovered any other way and I think we can help eliminate and reduce health disparities which is something that is really an important factor in our country for sure. I think that’s one of the important things as to why it is exciting, especially with the Qatar Biobank because it represents population genomic information that we know about to date, all these efforts that look at diverse population will help.
FLG: The program is funded by the National Institutes of Health. How important are collaborations with funders like this in driving data sharing as a whole, and for you within the program specifically?
JD: The NIH is really important to us, obviously in the funding role but beyond the funding role too. The NIH has really helped set a national mission around this that data should be open, that we centralise it, that we have a single institutional review board. All of those kinds of components would be hard to enforce in a non-single entity funded model that has shown leadership from the top. I think that the role of the NIH is very helpful in creating a better outcome that we might not have had if we had used other needs.
FLG: In the next five years, how do you predict the program’s level of success in contributing to the wider efforts of precision medicine?
JD: Five years is an interesting timeline, because it is just at the edge of what we think we might be able to actually predict. I think that the earliest wins in this programme should involve understanding drug response. I think as we get genetic data, one of the things we are certainly going to learn in this programme is understanding how people handle getting that kind of data back, as we are going to return data back to participants and they’ll be interacting with their physicians. We will learn a lot about how to utilise genomic information in healthcare and I think in the US that could start a conversation on a more national level about how you might use genetics in healthcare. That would probably be a conversation that is not completed by success and will probably be getting started in a robust way in five years’ time. I think they’ll be disease genetic discovery that will be starting to come out, especially when you start to look at diverse populations, because in five years we will still be building our resource, but we will have a big enough resource where I think some people will be able to do those research studies. I think a lot of the next five years in this programme is about building a resource, and a lot of the wins will be technology or informatics kinds of wins, how do you get electronic health records to talk well together, how do you enable participants to receive genomic information and share a move in electronic health records from one place to another, how do you curate some of that data and make it available to researchers in a way that we have confidence in and you can do studies on across hundreds of thousands of different survivors. I think a lot of those kinds of things will be our fabric underlying. One thing I think will be happening in five years, is engaging a diverse population of participants in a way that research studies really haven’t done before. I think that will be a key element, and we will certainly be well on our way there in five years. That will also start a conversation that may have a big impact on how research is done in other studies as well, that involves human subjects. I think all those parts of different conversations in that time frame might be just as important at that point as scientific discoveries, because I think a lot of the big scientific discoveries will take a lot longer than five years.