Why Sifting Through the Junk is Important

Long before the human genome was sequenced re-association studies
revealed that a considerable portion of our genome was composed
of highly repetitive sequences. The fastest annealing portion of the
genome corresponded to the approximately 500,000 Alu sequences
(SINES) and in total the highly repetitive portion of the genome represents some 45% of the total genome sequence. These sequences are distributed throughout the genome and present a real challenge to genome sequence assembly especially with most DNA sequencing technologies which produce short (less than 1 Kb for Sanger and considerably shorter for second generation sequencing technologies) reads.

The generation of the first draft of the human genome based upon Sanger sequencing thus only sequenced about 85% of the genome and as a large number of unconnected sequences. This revealed that there were only around 20,000 total genes and that the coding portion of the human genome only represented about 1% of the total genome. Thus there is some 54% of the human genome that is not highly repetitive and does not appear to code for protein. These sequences were considered to represent “junk DNA” that played little importance. Once the first draft of the human genome was generated it became possible to prepare tiling arrays consisting of tiled oligonucleotides distributed over the non-repetitive portions of the genome. Hybridization studies with RNA to these tiling arrays revealed that most of the human genome was transcribed into RNA and that transcripts that apparently did not code for proteins vastly exceeded coding transcripts. This then leads to the collaborative ENCODE (Encyclopedia of DNA Elements) project whose goal was to determine all the functional elements in the human genome. This project revealed that there were large numbers of sequences that are transcribed into RNA but do not code for proteins.

We now realize that there are a variety of non-coding RNAs that are now called ncRNAs. NcRNAs are grouped into two major classes based on their transcript size, small ncRNAs, which are less than 200 bp, and long ncRNAs (lncRNAs) which are longer than 200 bp. The small ncRNAs and the lncRNAs can be further divided and novel subclasses of ncRNAs continue to be discovered and characterized. NcRNAs have diverse functions within cells, from structural components to key regulatory molecules. Alterations and dysregulation of several ncRNAs have been reported in cancer and in this review, we will briefly go over the discovery of the different classes of ncRNAs and briefly summarize some of the work demonstrating how a number of the ncRNAs play important roles in a variety of different cancers.

Tiling Arrays and the Transcriptionally Active Genome 

The company Affymetrix produced oligonucleotide microarrays by utilizing the trailing edge of the photolithography used to produce computer chips. When the number of features increased to 6.5 million features, it became possible to make tiling arrays consisting of oligonucleotides that are tiled across the entire genome (with a sufficient number of chips), not just focusing on the exons of genes. Tiling arrays are classified by the distance from the middle of one oligonucleotide (corresponding to nucleotide 13 in that oligonucleotide) to the middle of the oligonucleotides immediately adjacent to that starting oligonucleotide. The smaller the distance between two tiled oligonucleotides the more oligonucleotides it would then take to cover large stretches across the genome. One of the pioneers in this field was Dr. Thomas Gingeras, who originally worked for Affymetrix which gave him the luxury of obtaining large numbers of tiling array chips to work with to probe the transcriptional activity across the human and other genomes. Dr. Gingeras worked with 5 bp tiling arrays that had tiled oligonucleotides across ten of the smallest human chromosomes (the tiled oligonucleotides were derived complementary to the non-repetitive portion of the human genome). When labeled RNA from human cells was hybridized to this array they found that there was transcription across each of these chromosomes (not just in the coding genes) and that the vast majority of transcripts detected did not correspond to any known human genes (1). Dr. Gingeras and his group then published articles using tiling arrays covering more of the human genome, all demonstrating that the vast majority of the human genome was transcriptionally active (2,3).

The discovery that much of the genome was transcribed led to the Encylopedia of DNA Elements (ENCODE) Consortium which was an international collaboration of research groups funded by the National Human Genome Research Institute. This work began with a pilot project aimed at approximately 1% of the human genome, but then was expanded into a full scale whole genome project. Various technologies were employed to characterize DNA elements that act at the protein and RNA levels, and regulatory elements that control cells and the activity of genes (4,5). These included measuring methylation across the genome, RNA sequencing (RNA-seq), CLIP-seq, Chip-seq as well as DNase hypersensitivity sites. In May and June of this year 86 datasets, of which 59 are human datasets and 27 were mouse sets were released and are now accessible for general use.

RNA-seq was vastly superior to tiling arrays for the characterization of non-coding transcripts especially because this technology could detect low abundance transcripts which could not be detected with tiling arrays. RNA-seq has revealed thousands of long transcripts whose length ranges from 200 nt to over 100 kilobases that are now called long non-coding RNAs (lncRNAs or lincRNA, for long intergenic ncRNA (6,7).

MicroRNA’s (miRNA) and Cancer 

The most well known and understood of the non-coding RNAs are the
miRNAs. They were first discovered in 1993 in C. elegans in the gene link-4, which affected development (8). This was found to be a small nonprotein coding RNA. MiRNAs are 19-24 nucleotide non-coding RNA molecules that regulate the expression of target mRNAs both at the transcriptional and translational level. Each member of this large family of non-coding RNAs can have hundreds of different targets (through partial base-pairing in mammals) and almost 30% of mammalian genes are regulated by, at least one miRNA (9,10). These non-coding RNAs are thus involved in multiple biological processes including proliferation, cell cycle regulation, and apoptosis. Thus, it is not at all surprising to find that a number of these molecules play important roles in the development of cancer.

Profiling studies in a variety of different tumors have demonstrated that many miRNAs are differentially expressed in tumors versus normal human tissues. This then leads to a further differentiation of the miRNAs into two groups: oncomiRs (that function as oncogenes and as such are usually overexpressed in cancer as they promote tumor formation and spread) and tumor-suppressor miRs (which impair tumor growth and are silenced due to mutations, promoter methylation, or chromosomal rearrangements) (11-14). Many of these non-coding RNAs are located within the common fragile sites which themselves are highly unstable during cancer development and prone to either deletions or amplifications.

The small size of miRNAs provide a powerful advantage especially
within the context of utilizing them as stable blood-based molecular
markers for cancer detection. It’s been demonstrated that the miRNAs
are present in human plasma in a remarkably stable form that is
protected from endogenous RNase activity (15). The miRNAs can also
be used as expression markers to directly characterize solid tumors
addition as they are diagnostic and prognostic markers of lung cancer (16).

Other Small ncRNAs and Cancer 

Two other groups of small ncRNAs are the piRNAs and the snoRNAs.
PIWI-family proteins and their associated small RNAs (piRNAs) function to protect the germline genome from the activity of transposable elements. Hundreds of thousands of piRNAs have been found in mammals (17). Deep sequencing has revealed that piRNAs are present in many more cell types than germline cells. Dysregulation of the expression of certain piRNAs are now being observed in a variety of cancers. For example, Cheng and colleagues demonstrated that piR-651 had higher expression in a variety of cancer tissues as compared to normal adjacent tissues (18). In another study, down-regulated of piR-823 was observed in gastric cancer tissues suggesting that it could have a tumor suppressor role (19).

Small nucleolar RNAs (snoRNAs) are small non-coding RNAs that are between 60 to 300 nucleotides long. They are normally located within introns of protein-coding genes and are transcribed by RNA polymerase II (20). The snoRNAs accumulate in the nucleolar
compartment and they are responsible for 2’-O-ribose methylation and
pseudouridylation of specific ribosomal RNA nucleotides. SnoRNAs
have been shown to be involved in the onset of Prader-Willy Syndrome
which is induced by genetic loss of the 15q11-13 locus. There are
several copies of the HBII snoRNA at this locus and when lost they are
correlated with the Prader Willy syndrome phenotype (21). More recent
studies have shown that a number of the snoRNAs are involved in
cancer formation and progression.

Long Noncoding RNA’s

RNA transcripts that are longer than 200 nucleotides but do not code
for any proteins are known as long non-coding RNAs (lncRNAs). These
transcripts are usually transcribed by RNA polymerase II and they
show epigenetic features that are commonly found in protein-coding
genes. The lncRNAs regulate several biological processes including
transcription (22), translation (23), cellular differentiation (24), cell cycle
regulation (25), chromatin modification (26), as well as nuclear cytoplasmic trafficking (27). The number of these transcripts continues to grow and they now number in the thousands. Recent work demonstrated that up to 20% of 3300 known lncRNAs are bound by the polycomb repressive complex 2 (28).

My laboratory’s limited contribution to this field came from early
studies using whole genome 35 bp tiling arrays. We cultured normal
human bronchial epithelial cells and then subjected them to the
carcinogen NNK (which is the major carcinogen in cigarette smoke). We identified a number of linRNAs that had greatly increased expression after this exposure. These transcripts were called long stress induced non-coding transcripts (LSINCTs). One of these, LSINCT5, was induced by a variety of other cellular stresses and we found this transcript had increased expression in most breast and ovarian cancers tested (29).This transcript (and presumably several of the other LSINCTs identified by us) also had a growth promoting effect upon cells (29).

The lncRNA H19, is 2.3 Kb in length and is encoded by the maternally imprinted gene H19 on chromosome 11p15.5, and was the first lncRNA determined to be associated with cancer (30). H19 expression is elevated in hepatocellular carcinoma and bladder and breast cancers (31). The oncogene c-myc directly activates H19 by binding to the H19 promotor, and the tumor suppressor p53 decreases H19 expression (32).

Two lncRNAs which are encoded adjacent to each other on chromosome 11q are NEAT1 and NEAT2 (MALAT). MALAT stands for metastasis associated lung adenocarcinoma transcript 1 and this
long transcript (7.5 Kb in length) functions in the regulation of gene
expression and proliferation and has been shown to be up-regulated
in a variety of different tumors (33). NEAT2 modulates the metastatic
potential of tongue squamous cell carcinomas through the regulation
of small proline rich proteins (34) (Fang Z et at. BMC Cancer 2016). NEAT1 is the nuclear paraspeckle assembly transcript 1. Interestingly, LSINCT5 was also localized to the paraspeckles. NEAT1 is a transcriptional target of p53 and modulates p53-induced transactivation of tumorsuppressor function (35). It is also a prognosis biomarker which regulates cancer progression via epithelial-mesenchymal transition in clear cell renal cell carcinoma.

A very novel lncRNA was discovered in 2007 by John Rinn using 5
bp tiling arrays. It is a 2.2 Kb ncRNA residing within the HOXC locus,
termed HOTAIR, which represses transcription in trans across 40
Kb of the HOXD locus (36). HOTAIR interacts with PRC2 (37). HOTAIR
is overexpressed in a number of different cancers and it regulates proliferation in breast cancer, melanoma, and other cancers. Elevated HOTAIR expression is also associated with cisplatin resistance in nonsmall cell lung cancer patients (38). Another lncRNA identified with a tiling array, but this time focused on the promoters of 56 cell-cycle genes was named PANDA. PANDA is induced in a p53-dependent manner and it interacts with the transcription factor NF-YA to limit expression of pro-apoptotic genes (39). Depletion of PANDA sensitized human fibroblasts to apoptosis by doxorubicin (40).

A large number of lncRNAs function as enhancers and are thus
termed enhance-like lncRNAs (eRNAs). Then there are the large
class of lncRNAs that are transcribed from the opposite DNA strand
to other transcripts and are termed natural antisense transcripts
(NATs). One well known NAT is ANRIL which is an antisense
lncRNA that originates frm the INK4B-ARF-INK4A locus. ANRIL
is overexpressed in prostate cancer tissus. Repression of ANRIL
reduces cellular proliferation and increased expression of p16INK4A
and p15INK4B (41).

An lncRNA that also appears to have oncogenic functions is the
steroid receptor RNA activator (SRA). This non-coding transcript is
a co-activator for steroid receptors and is found in the nucleus and
cytoplasm. SRA regulates gene expression mediated by the steroid
receptors through complexing with proteins that also contain
the steroid receptor coactivator 1 (42). SRA levels were found to be
upregulated in various tumors of the human breast, uterus, and
ovary (43).

Only a small fraction of the lncRNAs present within mammalian cells
have been characterized. What is clear is that lncRNAs have diverse
functions within cells and also play key roles in gene regulation. Hence,
the dysregulation of the expression of a number of the lncRNAs could
clearly play an important role in the development of a variety of
different cancers.


For references, please see page 34 in Front Line Genomics magazine. 

 

More on these topics