Imagine a government rogue spy trying to break into the president’s computer to access some classified sensitive information, and he needs a 2 letter password to access it. Suppose his strategy is to keep guessing and trying different combinations until he stumbles on the right password by chance. Suppose he has only an hour before the guards come to do their hourly inspections of the office. Will he be able to crack it in time? The spy has worked it out and is confident it will work and here is why:

  • There are 26 letters he can use to make any 2 letter combination. Which means the total number of possible combinations is 26² = 676.
  • Suppose each combination takes 5 seconds to key in and check whether its correct

He will have to try 676 combinations before getting the right one. The time he needs is 676 × 5 = 3380 seconds which is less than an hour. Therefore he will crack it in time just before the guard comes.

Suppose the president has been a bit smarter and changed it to a 4 letter password will he make it now? Possible combinations is 26⁴ =456976. The time he needs now is 2,284,880 seconds (634 hours). Clearly he just won’t be able to search through the entire space of possible passwords within the hour.

The blind watchmaker must search blindly for parts

The blind watchmaker also known as natural selection acting on the population requires variation to work. The source of the majority of variation comes from genetic mutations. It is random in that, it is unguided and blind, an error that occurs during DNA replication. DNA consists of a sequence of different nucleotides with each group of three nucleotides representing one amino acid. Proteins consists of a sequence of amino acids. Amino acids is to proteins what letters are to words. Each protein has a unique function determined by its unique sequence of amino acids. Amino acids can link up to form proteins of any length. Proteins have three structures. Primary structure, secondary and tertiary. The tertiary structure is the three dimensional arrangement of amino acids, called a fold, it is what gives proteins their biological functions. Proteins perform various functions such as catalyzing specific reactions, cell signaling as well as part of molecular complexes such as spliceosome, ribosome, and polymerase.


Figure 1: Central dogma of biology which illustrates the protein synthesis process. DNA contains genes which are sequences of nucleotides which code for a specific protein. Mutations in DNA lead to changes in proteins which then presumably ultimately lead to changes in the traits of an organism. Source: thinklink

Protein structure

The 3 dimensional structure of a protein gives it its unique function. For example replacing one amino acid in the hemoglobin protein causes it to lose its stable shape and this causes sickle cell disease. The mutant protein is unable to fold properly and therefore will not function. Not all protein sequences are able to form stable folds. Similar to how not all combinations of letters will produce meaningful words.

Figure 2: Primary, secondary and tertiary structure of proteins. Tertiary structure (protein fold) determines the biological function.

New organisms, require new physiological systems such as respiration, which need new organs, which means new tissues, new cell types and new proteins. The origin of new species must atleast begin with the origin of new functional proteins.

Protein Sequence space

Suppose we take a protein consisting of a chain of 150 amino acids, how many different possible sequences are there? There are 20 different types of amino acids used by amino acids which means: possible sequences is  =1.4 × 10195 To put this number into perspective – compare it to the total estimated number of individual organisms to have ever lived – which is estimated to be 1043. This value sets an upper limit on the available resources that random mutations have. If a search cannot be successful with this upper limit then it could not have occurred during the history of life. Clearly this is an astronomical number but fortunately random mutations do not have to search through the entire sequence space to find a functional protein.

fitness-landscape-1 fitness-landscape-2

Figure 3: Fitness landscape in sequence space.

 These figures illustrates the fitness landscape concept and how random mutations must randomly search through the space to find functional/fit proteins. Each point on the landscape represents a different protein sequence. The mountain peaks represent the most functional/fit sequences. The low flat parts of the landscapes represent non-functional sequences. The crucial question is the ratio of peaks to flat sequences. Random mutations must “blindly search” through the sequence space for functional peaks.



Figure 4: Types of fitness landscapes

(B) This figure shows a rugged landscape where functional proteins can get trapped in smaller peaks before reaching the highest peak. (C) Ideal fitness landscape where each point mutation leads to a higher and fit protein. If the landscape is rugged then incremental, one step at a time mutations will not find the highest peak but will get trapped in low fitness valleys. If the landscape is a smooth upward hill – then one step at a time mutations can slowly find the peak of the landscape.

So the two crucial questions are then:

  1. What kind of landscape do proteins have? Is it rugged or smooth?
  2. What is the ratio of functional proteins to non-functional proteins in the protein landscape?

Experimental results for the rarity of functional proteins

1. Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds

Douglas Axe conducted an experiment to estimate what the ratio of functional folding protein sequences is in comparison total sequence space.

“Starting with a weakly functional sequence carrying this signature, clusters of ten side-chains within the fold are replaced randomly, within the boundaries of the signature, and tested for function. The prevalence of low-level function in four such experiments indicates that roughly one in 10⁶⁴ signature-consistent sequences forms a working domain. Combined with the estimated prevalence of plausible hydropathic patterns (for any fold) and of relevant folds for particular functions, this implies the overall prevalence of sequences performing a specific function by any domain-sized fold may be as low as 1 in 10⁷⁷, adding to the body of evidence that  functional folds require highly extraordinary sequences”[i]

His experiment showed that for a 148 amino-acid long portion of a protein, only one in 10⁶⁴ sequences will form a fold. Remember not all proteins that fold will perform a function. He estimated that only one in 10⁷⁷ sequences will be functional. Douglas Axe cites previous experiments conducted on a 92 amino acid long protein chain to find the ratio of functional proteins. He says, Reidhaar-Olsen and Sauer estimated the proportion of 92 residue sequences that form a functional l-repressor fold to be 10⁶³.[ii]

2. Functional proteins from a random-sequence library

Scientists investigated how rare or common is an enzyme that binds ATP for an 80 amino-acid long protein amongst the total possible sequences, they conclude:  In conclusion, we suggest that functional proteins are sufficiently common in protein sequence space (roughly 1 in 10¹¹) that they may be discovered by entirely stochastic means, such as presumably operated when proteins were first used by living organisms. We therefore estimate that roughly 1 in 10¹¹ of all random-sequence proteins have ATP binding activity comparable to the proteins isolated in this study.”[iii]

3. Searching sequence space for protein catalysts[iv]

The authors evaluate how many proteins are required to find a functional AroQ class chorismate mutase (CM). Mutase are enzymes that are in the isomerase class – they are used to catalyze reactions where the shifting of a functional group in a molecule to a different position in the same molecule is required. They say, “Genetic selection was used to explore the probability of finding enzymes in protein sequence space…This study provides a quantitative assessment of the number of sequences compatible with a given fold and implicates previously unidentified residues needed to form a functional active site”. The study then concludes, “Extrapolating from our data and from modest sequence constraints on interhelical turns (23, 28–30), we can estimate that if every position in the protein had been randomized, a library of 10²⁴ members would have been needed to obtain AroQ mutases.” This means that for every 10²⁴ proteins only one will be a functional AroQ mutase enzyme.

4. Experimental Rugged Fitness Landscape in Protein Sequence Space

Researchers conducted an experiment to find the protein fitness landscape of a particular protein. They randomized a section of the protein (139 amino acids), specifically the D2 Domain section which is crucial for phage infection. They then estimated how large a library (number of different sequences) would be required to find a protein with a fitness level equal to wild type protein (naturally occurring).

“In practice, the maximum library size that can be prepared is about 1013 [28,29]. Even with a huge library size, adaptive walking could increase the fitness, ~W , up to only 0.55. The question remains regarding how large a population is required to reach the fitness of the wild-type phage. The relative fitness of the wild-type phage, or rather the native D2 domain, is almost equivalent to the global peak of the fitness landscape. By extrapolation, we estimated that adaptive walking requires a library size of 10⁷⁰ with 35 substitutions to reach comparable fitness. Such a huge search is impractical and implies that evolution of the wildtype phage must have involved not only random substitutions but also other mechanisms, such as homologous recombination.[v]


Figure 5: Experimental results of protein fitness landscape for D2 domain required for phage infection

“First, the smooth surface of the mountainous structure from the foot to at least a relative fitness of 0.4 means that it is possible for most random or primordial sequences to evolve with relative ease up to the middle region of the fitness landscape by adaptive walking with only single substitutions.”

Their results indicate that adaptive walking or step by step mutations would be able to climb the fitness landscape up to a certain fitness level indicated of 0.4. The wild type fitness (the actual protein activity found in organisms in the wild) is indicated as 1. The landscape is smooth and single point substitutions can climb up the hill. But above fitness level of 0.4 the landscape becomes rugged and fit sequences become considerably rare. They estimate to reach a peak fitness of 1; it would take 10⁷⁰ different sequences and 35 mutations from the bottom to the peak.

In other words there is one protein with a peak fitness of 1 per 10⁷⁰ proteins. They conclude that a Darwinian step by step mechanism would not be able to find the wild type phage due to the large protein sequence space required to search through and that 35 mutations would be required. Therefore they think that homologous recombination was the mechanism used by nature to find wild type phage protein.


Figure 6: Experimental and theoretical (cytochrome c) results of protein fitness landscapes of different proteins. Values represent the total number of non-functional proteins per functional protein. So for the beta-lactamase domain – 10⁷⁷ proteins must be searched through in order to find one functional protein.

Experimental results show that:

  • The fitness landscape of proteins can be rugged which means that darwinian point mutations are not sufficient for finding the peak of proteins.
  • Protein sequence space is populated by non-functional proteins.

In part 2 I will continue looking at more in depth evidence for the rarity of folding and functional proteins in total sequence space.


[i] Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds, Douglas D. Axe. J. Mol. Biol. (2004) 341, 1295–1315

[ii] Estimating the Prevalence of Protein Sequences Adopting Functional Enzyme Folds, Douglas D. Axe. J. Mol. Biol. (2004) 341, 1295–1315

[iii] Functional proteins from a random-sequence library, Anthony D Keefe and Jack W. Szostak

[iv] Searching sequence space for protein catalysts. Sean V. Taylor, Kai U. Walter, Peter Kast, and Donald Hilvert. PNAS September 11, 2001 vol. 98. http://www.pnas.orgycgiydoiy10.1073ypnas.191159298

[v] Experimental Rugged Fitness Landscape in Protein Sequence Space. Yuuki Hayashi, Takuyo Aita, Hitoshi Toyota, Yuzuru Husimi, Itaru Urabe, Tetsuya Yomo. December 2006. Issue 1