Source: NHGRI

New research has identified nearly 5,000 previously unknown genes, 1,178 of which are believed to code for proteins. This work, which was published as a preprint by bioRxiv in May, would raise the estimate of coding genes from 20,000 to 21,000, renewing a long-standing debate about the size of the exome.

Scientists have been interested in quantifying coding genes for more than 50 years, with the earliest estimates first appearing in 1964. Because of the limitations of the technology available at the time, these early estimates were largely based on guesswork and flawed reasoning, putting them in the region of 50,000 to 100,000 coding genes. After the completion of the Human Genome Project in 2001, researchers had more data to work with and a range of new, much lower estimates were published, falling within the range of 26,000 to 40,000 coding genes. Since then, the estimated number of coding genes has continued to fall, with the most recent estimates being 20,500 (a 2007 genomics analysis) and 19,000 (a 2014 proteomics study).

Now, researchers primarily from Johns Hopkins University School of Medicine have established a new higher estimate using data from 9,795 of large-scale RNA sequencing experiments. Their analysis identified 21,306 genes responsible for protein coding and 21,856 genes that are noncoding; of these, 1,178 coding and 3,819 noncoding genes had not previously been identified. The team’s total is much greater than the most recent estimates in 2007 and 2014 and demonstrates how difficult it can be for researchers to define and identify genes.

For example, the team, led by Steven Salzberg, PhD, identified a large number of RNA molecules that could be linked back to stretches of DNA that were not truly genes. During their analysis, the team needed to differentiate between these expressed regions and true genes in order to ensure that their estimate was as accurate as possible. To do so, they compared their human data to that of other related species, with the reasoning that DNA sequences conserved through evolution were likely to have a useful function and thus were likely to be genes.

Not everyone is convinced by this work, however. As Nature reports, Adam Frankish, a computational biologist at the European Bioinformatics Institute, and his team have scanned 100 of the coding genes identified in the study and believe that only one of them is a true coding gene. Researchers at RefSeq, a database from the US National Center for Biotechnology Information, have also voiced concerns that many of the results obtained by the Salzberg team would not be admissible to their database.

The Salzberg team acknowledge that the newly identified genes need to be validated by other teams before they can be considered accurate, but they stand by their data. With the vast complexities of the genome still defying our attempts to pin down information and an ever-changing definition of what constitutes a gene, it is likely that the debate over the size of the exome is likely to continue for some time yet.

As Dr Salzberg said, “People have been working hard at this for 20 years, and we still don’t have the answer.”