Breaking the DNA Code
by
Harold Richman
The Rosetta stone was found in Egypt in 1799 by a French officer in Napoleon's army. It was found at the mouth of the Nile near a town called Rosetta. It is a black basalt rock with an inscription written in three languages, hieroglyphics, Demotic (the popular language of Egypt at that time), and Greek. Finally, after many attempts by various Egyptologists, a French scholar, Jean François Champollion, was able to break the code twenty-three years after the Rosetta stone was discovered. It was apparent that the three inscriptions recorded the same event. Champollion translated the Greek and, comparing it to the hieroglyphics, was thus able to break the code. The Rosetta stone is on display at the British Museum in London. It is in such perfect condition that when I saw it for the first time I thought it was a reproduction rather than the real thing.
When Watson and Crick discovered the shape of the DNA molecule in 1953, they had the equivalent of the Rosetta stone. They knew that the total genetic information of all living organisms was contained in the DNA molecule. They had the equivalent of the Rosetta stone without the Greek translation. Somehow the information was coded in the sequence of nucleotides that make up the rungs of the double helix. Many scientists did not think that it could be possible to encode the information required using only four nucleotides. (These are usually referred to by their first Letter: A=Adenine, T=Thymine, C=Cytosine and G=Guanine.)
Watson and Crick knew that the main products produced in a cell were proteins and enzymes. The enzymes are a special kind of protein and are sometimes referred to as biological catalysts. They enable various metabolic reactions to take place in the cell but are not themselves consumed in the reaction.
Another important clue was provided by Linus Pauling. He had shown that proteins consist of long chains of amino acids strung together like pearls on a necklace. There are many naturally occurring amino acids but only twenty of these are found in proteins.
Watson and Crick made the following assumption which proved to be correct. If the amino acids were sequentially connected, the instructions contained in the DNA would also be sequential. After trying many different possible coding systems, they came to the conclusion that in order to code for twenty amino acids, the sequence of nucleotides would have to be in three's.
With an alphabet containing four letters (A,T,C and G), a word that is only one letter long could only specify four amino acids. If you had a word that was two letters long, you could specify sixteen amino acids (4x4=16). This was not sufficient so the word had to be three letters long which would provide sixty-four different words (4x4x4=64). The three-letter words are called codons and all sixty-four codons have now been identified. Some are used to indicate the end of a gene. Some amino acids are identified by more than one codon.
Human DNA has about three billion letters but only a small fraction are used to code for proteins. The rest is usually referred to as junk, but in the future we may find some reason for the extra nucleotides. At present scientists suggest that the junk consists of proteins that evolved but were later replaced by more efficient versions. There does not seem to be a mechanism to remove the defective code, so it is just turned off.
A gene is a specific length of the DNA molecule containing a start and end marker. There are approximately 100,000 genes in the human genome and yet all the proteins specified by the genes can be assembled by using a code containing only four letters. An even more amazing fact is that this enormous amount of information is contained within a cell that is only one-fifth the size of the period at the end of this sentence.