We hear a lot about how similar the human genome is compared to the chimpanzee genome. As I have discussed previously, if we compare the genomes one way, they are 72% identical. If we compare them another way, they more than 95% identical. If we compare them yet another way, they are 88-89% identical. That’s a wide range of results! Why can’t we say definitively how similar the human genome is to the chimpanzee genome? There are probably several reasons for this, but I want to highlight a basic one. Even though the human and chimpanzee genomes have been sequenced, we still don’t know them as well as you might think.
To understand why we don’t know these sequenced genomes very well, you need to know a bit about how DNA stores information. As most people know, DNA is a double helix. Each strand of this double helix has a sequence of chemical units called nucleotide bases. There are four different nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). Taken three at a time, these four nucleotide bases code for a specific kind of chemical called an amino acid. The two strands of the double helix hold together because the nucleotide bases on one strand link up with the nucleotide bases on the other strand.
As shown in the illustration above, the way the nucleotide bases link up is very specific. Adenine (A) links only to thymine (T), and cytosine (C) links only to guanine (G). Because of this, if you know the sequence on one strand of DNA, you automatically know the sequence on the other strand. After all, A can only link to T, so anywhere one strand has an A, the other strand must have a T. In the same way, C can only link to G, so anywhere one strand has a C, the other strand must have a G. So the two strands of the DNA double helix are held together by pairs of nucleotide bases.
As a result, we count the length of a genome in terms of how many base pairs there are. The illustration above, for example, has 14 base pairs (the black G is hiding a C behind it, and the black A is hiding a T behind it). Obviously, then, the larger the number of base pairs in the genome, the longer the genome is. Believe it or not, even though the human and chimpanzee genomes have been sequenced, we don’t know for sure how long either of them are!
When a genome is sequenced, scientists don’t start at the beginning and determine each base pair until they get to the end. We can’t analyze DNA that way. Instead, we take the DNA and chop it up into little chunks that are generally less than 1,000 base pairs long. When that happens, the order of these chunks is lost. As a result, a sequenced genome just consists of a lot of chunks. The scientists then try to piece those chunks together by looking for regions of overlap among the chunks. This is called “genome assembly,” and it is a terribly difficult thing to do.
With all that information under our belts, let’s look at the human genome as reported by the Ensembl Project, a joint effort between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute. There are two ways to count the number of base pairs in a genome. First, you can just count the number of base pairs that are found amongst the chunks that have been analyzed. According to Ensembl, there are 3,286,906,305 base pairs in the human genome. However, there is another way you can count the number of base pairs. You can look at the genome assembly that was done, and count the base pairs that are thought to be in that assembly. The Ensembl project calls this the “Golden Path Length,” and based on that method, there are 3,101,804,739 base pairs in the human genome. If we really knew the human genome, those two numbers would be the same. Thus, the difference between the numbers (about 6%) gives us an idea of how well we know the human genome.
Now let’s look at the same source to learn about the chimpanzee genome. There are 2,995,900,563 base pairs in the genome, but the Golden Path Length is 3,309,561,368 base pairs. Once again, if we knew the chimpanzee genome exactly, these two numbers would be the same. However, they are 9.5% different. That tells us that at least in one way, there is an error of about 9.5% when it comes to how well we know the chimpanzee genome.
What does this say about how well we can compare the human and chimpanzee genomes? To me, it indicates that we can’t compare them very accurately. After all, we know the length of one genome to only 6% accuracy, and we know the length of the other one to only 9.5% accuracy. At the very best, then, the error in our comparison will be around 9.5%, the error associated with the genome we understand the least.
In the end, then, while it is interesting to compare the genomes of two different species, we need to take the results of those comparisons with a grain of salt. When we can’t even tell you how long each genome is, it’s not clear how accurately we can determine how similar they are!
Would you expect the unknown lengths to be comprised mostly of coding DNA or what would conventionally be called “junk DNA”?
Josiah, that is the expectation. Remember, they need to sort through the chunks and see how much unique DNA is in each chunk. They do that by looking at overlapping patterns. In genes, the patterns are very easy to find. However, there are lots of sections of DNA where a single nucleotide base is repeated over and over again. While all such regions were at one time thought to be “junk DNA,” we now know that at least some of them have function. Obviously, I doubt that very much of it is junk. However, because it’s composed of repetitive elements, it is hard to identify for certain using the DNA sequencing method that is currently employed.
I had never heard before that we don’t even know how long the genomes are. Guess I sort of assumed that a completely sequenced genome would have a known length. You learn something new everyday reading this blog! 🙂
I am glad that’s the case, Vivielle!