You’ve heard it many times before. The vast majority of DNA is junk. Of course, the ENCODE project showed how wrong that notion is. Now that we know the vast majority of DNA is functional, you might wonder how in the world the idea of “junk DNA” became so popular among scientists. I suspect there are many reasons, but some recent research has revealed one of them – a bias regarding what it means for DNA to be functional. The research was done on molecules called long non-coding RNAs, which are commonly referred to as lncRNAs.
What are lncRNAs? Well, let’s start with what RNA is. The genes that your body uses are in your DNA, most of which is found in the control center of the cell, called the nucleus. In order for your cells to use those genes, they must be copied by another molecule. This process is called transcription, and the molecule that performs transcription is RNA. Once it has transcribed the gene, RNA leaves the nucleus, at which point it is often referred to as messenger RNA (mRNA), because it is sending a message to the cell.
What’s the message? It is a recipe for building a protein. That recipe is put together in informational units called codons, and it goes to a ribosome, which is a protein-making factory in the cell. The ribosome reads the codons, translating them one-by-one into a protein. Not surprisingly, this process is called translation. How does the ribosome know when to start building the protein? There is a start codon that tells it to start. How does it know when to stop building the protein? There is a stop codon that tells it to stop. As a result, you can think of messenger RNA in terms of the illustration above – it contains a start codon, a recipe for a protein (the blue bar in the illustration), and a stop codon.
So how does this relate to lncRNAs? Well, messenger RNA is referred to as “coding RNA,” because it codes for the production of proteins. LncRNAs are called “non-coding RNAs,” because it was thought that they do not code for proteins. Now there are lots of RNAs that are thought to be non-coding, but lncRNAs are relatively long. That’s how they get their name. Well, it turns out for at least some lncRNAs, every part of their name (except RNA) is wrong.
In February of this year, Dr. Alexander Schier and his colleagues were looking at the development of zebrafish embryos. They analyzed the proteins that were present during the process, and they found several that had not been previously identified. One of them was a small protein that had exactly the sequence one would expect if it had been coded for by a lncRNA that has been called Toddler. In their study, they produce six lines of evidence that this small protein is made by the cell using Toddler.1 In other words, even though Toddler is known as a long non-coding RNA, it actually does code for a short protein, and during the development of the zebrafish embryo, the cell translates it into a short protein. (As a point of terminology, a short protein is often called a peptide.)
How common is this? Well, a more recent study has identified hundreds of short proteins (peptides) in both zebrafish and humans that are produced from what were thought to be long non-coding RNAs. How did they identify them? They identified the signs of ribosomes reading the lncRNAs. In other words, it’s not just that there are proteins which have the sequences you would expect from lncRNAs. The research team actually showed that those lncRNAs went to a ribosome and got translated!2
So…if there really are hundreds of lncRNAs that actually do code for proteins, why did geneticists call them non-coding? When I first started reading about them, I assumed that geneticists thought they were non-coding because they didn’t have the structure of messenger RNA, as shown in the illustration above. Maybe they didn’t have a start codon. Maybe they didn’t have a stop codon. I just assumed that something must be missing from them. Why else would they be called non-coding?
However, with help from a professor who has forgotten more molecular genetics than I will ever learn, I found out that lncRNAs have exactly the structure you find in messenger RNA. They have a start codon, a recipe for a short protein, and a stop codon. Why, then, were they called non-coding? Because the recipe is really short. Geneticists just assumed that recipes for short proteins just don’t get translated, so RNAs that contain short recipes were thought to be non-coding. Now, of course, there’s more to it than that. Geneticist also tend to compare the RNAs they find in cells to known proteins, and there aren’t that many known short proteins in living organisms.
So in the end, we now know that geneticists’ bias towards large proteins led to a big mistake. At least hundreds of long non-coding RNAs are, in fact, short coding RNAs! It will be interesting to see how this line of research progresses. If you look at all the RNA that is produced in animals and people, you find a bewildering number that have start codons, stop codons, and a recipe for a short protein. I suspect that as time goes on, we will find a bewildering number of short proteins that are produced by those RNAs, at least at specific stages in the life of most organisms.
1. Pauli A, Norris ML, Valen E, Chew GL, Gagnon JA, Zimmerman S, Mitchell A, Ma J, Dubrulle J, Reyon D, Tsai SQ, Joung JK, Saghatelian A, and Schier AF, “Toddler: an embryonic signal that promotes cell movement via Apelin receptors,” Science 2014, doi:10.1126/science.1248636
Return to Text
2. Bazzini AA, Johnstone TG, Christiano R, Mackowiak SD, Obermayer B, Fleming ES, Vejnar CE, Lee MT, Rajewsky N, Walther TC, and Giraldez AJ, “Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation,” EMBO Journal, 33(9):981-93, 2014
Return to Text