Origins Enigma – Meditation 1

The origin of life is one of the grand mysteries of modern biology. A fairly recent realisation is that the ‘Origin’ might be one of many ‘origins’ which melded into the current ‘Life As We Know It’. However I’m not looking at that question in this opening discussion. Examining the ‘complexity’ of Life and its origin process needs an opening point. Back in 1985/86 Robert Shapiro, an origin of life researcher, wrote an evocative high-level discussion of all the options as they seemed at the time – “Origins” is a brilliant book and well worth a look. Our knowledge has expanded immensely since then with a multitude of bacterial and higher organisms now sequenced and their proteins mapped out.

Shapiro set the question out like so: imagine a vat full of bacteria. It’s a pressure cooker and the contents are raised to the temperature where they’re reduced to their atomic components. The vat is then cooled down. What are the odds that a bacterium will then reassemble out of the raw materials? If the number of atoms in a bacterium is about ~10 billion, 10 to the power 10 (10¹⁰), then Shapiro estimates the odds to be super-exponential – 10^{10^10}. Such numbers need some contemplation. The total number of atoms in the visible universe is about 10¹⁰⁰ – a google – but in super-exponential form that’s 10^{10^2}. Thus that already incredible figure has to be raised to the 100 millionth power – multiplied by itself 100 million times – to equal 10^{10^10}.

Chemical thermodynamic probability is essentially a measure of the information required to specify a particular molecular configuration. Information is exponential. Any particular binary sequence is one selection out of 2⁽ⁿ⁾, n being the number of bits. Thus a selection of 1 out of 10^{10^10} possible options is ln(10^10^10)/ln(2) = 33.22 billion bits, where ‘ln’ is the natural base logarithm. At this point that seems like an incredibly large amount of information is needed to define a bacterium. However that jibes with the known amount of genetic information used by the average bacterium to reproduce itself.

For example, “Pelagibacter ubique” is the most common free-living bacteria in the oceans, with a genome that’s 1,308,759 base pairs in length. As each base-pair is one of 4 options, that’s 2 bits per base pair for a total of 2,617,518 bits. There’s about a billion atoms in each bacterium, thus about 3.322 billion ‘molecular bits’ specifics its unique state. Yet clearly a genome doesn’t need to specify the position of every single atom in a bacterium – instead the ‘units’ specified seem much bigger than mere atoms – 3.322E+9 / 2.617518E+6 = 1,268.4. Genomes use sets of three base pairs (a codon) as the Genetic Code, which specifies 21 different options (20 amino acids plus ‘stop’) thus the actual amount of information is a bit less than the maximum 6 bits per codon. Pelagibacter ubique’s genome is thus about ~ 2 million bits of information, with each DNA bit specifying a unit that’s about 1,500 times fuzzier than the maximal atomic level precision. The number of options represented by 2 million bits is ~10^{10^6}, which is still an immense range of options.

All living things share in being ‘fuzzily’ specified on larger and larger scales. An average adult human body, massing about 72 kilograms of lean mass, is composed of 7,200 trillion trillion atoms (7.2E+27 atoms.) Yet rather than needing to specify all 24 thousand trillion trillion molecular bits, the average human genome is only about 6 billion bits. That’s a ‘fuzziness’ of 4 million trillion. So how is it possible? Firstly a human, like any other large animal, begins as a single cell. That cell is then duplicated many trillion fold, with their collective interactions during development determining their place and role in the resulting body. In information science terms, multicellular organisms unpack themselves from highly compressed information, thanks to duplication and then variation of the basic duplicates.

If every cell type was specified with every selection of proteins to be expressed over the organism’s lifetime the total information to be encoded would be millions of times more than the ~billion bits in a string of average organismic DNA. Instead of a ‘recording’ of every protein to be made, the data for each protein is ‘recalled’ multiple times – millions of times over a life-time.

More specifically for humans, we have roughly ~20,000 genes, but about a million proteins are encoded by those genes. Most are produced as ‘splice variants’ and mutations in those variants is the source of a large part of human diversity. Human genomes differ, on average, by about 0.1%. That’s ~6 million bits of difference – again ~10^{10^6} possible options. In human history there’s been about 100 billion people, thus ~10^{10^1} have existed out of the 10^{10^6} who *could* exist.

When genomic sequences of different organisms show a large percentage of similarity, the odds of such similarity arising purely by random chance is dismissed (rightly) as absurdly unlikely. A far simpler purely natural explanation is common ancestry. The resulting Molecular Family-Trees are ambiguous, however, because so many family-trees for any given gene sequence can be reconstructed from a given batch of data. This is true even within a species.