Genome 3.0: ENCODE Takes Our DNA From Junk to Treasure

In the beginning, there were genes. From work mostly in bacteria, scientists believed that these pieces of DNA were the key to who we were. And that what made us more complex than a bacterium or a water bug was that we had more genes.

This wasn’t crazy or anything. Given what they knew, genes seemed like the obvious candidates for making us who we are.

Remember, genes are just the instructions for proteins and proteins are the molecules that do the heavy lifting in our cells. The idea was that more complex creatures would do more things and so would need more proteins and so have more genes.

Scientists figured we’d be a bit like the bacteria they were studying. The thinking was that since by and large bacterial genomes are mostly genes, ours should be too. But once we managed to develop the technology to sequence the human genome, we saw that we don’t resemble bacteria nearly as much as we thought.

Genome 2.0

At the beginning of the 21st century, scientists sequenced the human genome. To everyone’s surprise there were only something like 20,000 – 25,000 genes. Most scientists expected 100,000 or more.

waterFlea — She has more genes than you do.
Courtesy PLOS Genetics.

The genome appeared to be long stretches of not-genes interspersed with the occasional gene. Given our still gene-ocentric way of thinking, we called all the DNA that wasn’t a gene “junk” DNA. We figured it was the flotsam and jetsam of billions of years of evolution cluttering up our genomes.

As more and more organisms were sequenced, scientists kept getting the same result. Ten to thirty thousand genes spread across vast stretches of what looked like junk DNA. And to make matters worse, gene number didn’t appear to relate to complexity.

For example, a water flea has around 8,000 more genes than you do. A mouse, roundworm and a flowering plant called Arabadopsis has the same number as you. At least a fruit fly only has 14,000 or so! Clearly there isn’t a lot of correlation between complexity and gene number.

The current theory for what is going on is that the key factor in making us complex is how we use the genes we have. A mouse uses its 23,000 genes one way, roundworms another and humans a third way. Or more accurately, each uses its genes in lots of different ways depending on cell type, DNA sequence, environment and so on.

The ENCODE project appears to support this idea and to show that a lot of that junk DNA is actually involved in controlling genes. One man’s junk is another man’s treasure.

Genome 3.0

The ENCODE project set out to figure out how genes are controlled. They focused on proteins called transcription factors (TFs) because scientists already had a pretty good handle on how these things work.

TBP — Transcription factors like this one are proteins that bind DNA and control how much a gene is turned on.

TFs stick to certain bits of DNA and control genes from there. So one might like the DNA sequence AAGCTT and another might like TATAAA. Each TF then sticks to any of its preferred sites that are accessible and the level of gene expression is determined by which factors are bound where for how long.

The ENCODE project found way more of these TF binding sites than they thought they would. And while they matched up half a million to nearby genes, this still left millions that were without a gene. They are almost certainly controlling genes, we just don’t yet understand how.

One way they are probably controlling genes is through some sort of long distance control. Scientists already knew of many cases where DNA far from a gene can influence its expression. (And when I say far, I mean far. Our chromosome 1 is over 3 inches long on its own. That is massive in the microscopic world. If the nucleus where the DNA is stored was the size of a baseball, the DNA of chromsome 1 would be over 100 miles long. That is quite a bat!)

One way this long distance control is thought to happen has to do with the fact that this long molecule isn’t stretched out. Instead it is packaged in an ordered clump.

So even though those TFs I talked about are arranged along these long molecules sort of like beads on a string, pieces that are far away on the DNA can be close in 3D space. Members of the ENCODE project identified and analyzed over 1000 of these DNA loops.

Another way to influence gene expression is by affecting the accessibility of the TF binding sites on the DNA. If a site is hidden, it can’t be bound by a TF and so can’t affect any genes.

Accessibility is controlled by that packaging (chromatin) I talked about earlier. Some DNA is in parts of the clump that the TFs can get to and some isn’t.

Another ENCODE group determined how the pattern of accessibility is different in different cell types and firmed up the idea that TF binding can be like an avalanche. One TF binds which opens the DNA up allowing more to bind which opens up more DNA and so on.

Among its many contributions, the ENCODE data provides us with the start of a cataloging of what parts of our DNA are important and, to a lesser extent, which parts interact with each other. This data is already starting to bear fruit in that it is providing us with clues to how certain genetic differences can affect a person’s chances for developing genetic diseases. In particular, certain findings with regard to Crohn’s disease are becoming more understandable.

The results I’ve described so far aren’t really a paradigm shift except in terms of the percentage of DNA being used. It is things any molecular biologist knew before just spread out over more of our DNA and described in much more detail. This is not true of all of the results though.

Another big finding of the ENCODE project is that over 75% of our DNA is copied into RNA in one cell or another. This was unexpected and might one day push us to Genome 4.0.

You may remember that genes are copied into RNA before they are translated into proteins. Given that only 2% of our DNA is genes, all this RNA is not being translated into proteins. Which given what we’ve learned over the past few years, isn’t unexpected.

Scientists had been finding lots of different RNAs involved in, you guessed it, controlling gene expression levels. But no one expected there was this much untranslated RNA. There is so much of it that it is unlikely it is all contributing to controlling genes in the ways we’ve identified so far.

No, there are still lots of things to find out and when they figure out what this RNA is doing, we’ll learn more about our genomes. And I can’t wait to see what else they find as they’re figuring it out.