The Book of Life - Mathematics of Life

Mathematics of Life (2011)

Chapter 8. The Book of Life

In 1990 the world’s geneticists embarked on the most ambitious programme of biological research the world had ever seen, which many compared in its scale to the Kennedy-era project to land a man on the Moon. Biologists were aiming to join the ranks of big science, previously occupied mainly by particle physicists, nuclear physicists and astronomers, where governments were willing to spend billions of dollars rather than mere millions. This financial aim was explicit, but the scientific objective was impeccable and important: to sequence the human genome – that is, to obtain the complete sequence of DNA bases in a typical human. It was known that there are about three billion of these, so the task would be difficult, expensive, but feasible. Just right for big science.

The project emerged from a series of workshops supported by the US Department of Energy, beginning in 1984 and leading to a report in 1987.1 This set the goal of sequencing the human genome, pointing out that this objective was ‘as necessary to the continuing progress of medicine and other health sciences as knowledge of human anatomy has been for the present state of medicine’. In the popular media, biologists were in search of the Book of Life.

In 1990 the Department of Energy and the National Institutes of Health announced a $3 billion project – one dollar per base pair. Several other countries joined the USA to create a consortium: Japan, the United Kingdom, Germany, France, China and India. At that time, finding even short DNA sequences was time-consuming and laborious, and it was estimated that the project would take fifteen years. This estimate was not far out, despite huge improvements in sequencing technology. However, a latecomer to the game demonstrated that the whole project could have been completed in about three years, for one-tenth of the cost, by putting more effort into thinking and less into complicated biochemistry. In 1998 Craig Venter, a researcher at the National Institutes of Health, founded his own company, Celera Genomics, and set out to derive the entire sequence independently using $300 million contributed by private investors.

The publicly funded Human Genome Project (HGP) published its new data on a daily basis, and announced that all its results would be freely available. Celera published its data annually, and announced its intention to patent some of it – a few hundred genes. In the event, Celera initiated preliminary patents on 6,500 genes or partial genes. These intellectual property rights were what, with luck, would repay investors. A corollary was that Celera’s data would not be free for all researchers to use; the company’s initial agreement to share data with HGP came to bits when Celera declined to lodge its data in the publicly accessible GenBank database. But Celera used HGP’s data as part of its own effort – well, it was public.

To protect the free release of vital scientific data, HGP took steps to publish its data first, which would (subject to legal wranglings) constitute ‘prior art’ and invalidate Celera’s patents. In the event, HGP published the ‘final’ sequence a few days before Celera. By then, President Bill Clinton had already stated that he would not permit the genome sequence to be patented, and Celera’s market value plummeted. The NASDAQ stock exchange, overloaded with biotechnology companies, lost tens of billions of dollars.2

In 2000 Bill Clinton and Tony Blair announced to the world that a ‘draft genome’ had been obtained. The next year, both HGP and Celera published drafts which were about 80% complete. An ‘essentially complete’ genome was announced by both groups in 2003, though there were disagreements about what this phrase meant, but improvements in the period up to 2005 led to a sequence that was about 92% complete. The main stages of the programme were to sequence complete chromosomes – recall that we humans have 23 pairs of these. The sequence for the final human chromosome was published in Nature in 2006.

By 2010 most of the gaps in the sequence had been filled, although a significant number remain. So do numerous errors. A dedicated group of geneticists has undertaken the task of filling the gaps, and eliminating the errors in supposedly ‘known’ regions. This is scientifically essential, but they will get little credit for it because the exciting frontier has moved on. Their devotion to science is admirable.

How do you ‘read’ the sequence of such a huge molecule as DNA? Not by starting at one end and proceeding along it to the other.3 Current sequencing techniques do not work on gigantic molecules: they typically use stretches of 300 – 1,000 DNA bases. The way round this restriction is obvious, though not its implementation: break the molecule into short fragments, sequence them, then stick them all back together in the right order.

The first effective sequencing method goes back to Allan Maxam and Walter Gilbert in 1976. Their idea was to change the structure of the DNA molecule at specific bases, and add a radioactive ‘label’ at one end of each fragment. Four different chemical processes then targeted the four types of base. It would have been neat and tidy if these processes could cut the strand at A, C, G and T respectively, but the chemistry didn’t work out like that. Instead, two processes created cuts at specific bases, C and G, while the others had a little ambiguity: they created cuts at either of two distinct bases: A or G, and C or T. However, if you knew the ‘A or G’ data and the G data, you could deduce which of the ‘A or G’ cuts were A and which were G, and similarly for the ‘C or T’ cuts. Knowing which type of base is located at the cut, and using the radioactive label to sort the fragments into order by letting them diffuse through a sheet of gel, you could deduce the sequence of bases. This method is called gel electrophoresis, because it passes an electric current through the gel to make the molecules diffuse.

The next advance was the chain-terminator method, also called the Sanger method after its inventor, Frederick Sanger. This procedure also creates fragments of varying lengths, which are similarly diffused through a gel to sort them into order. The cunning step is to attach fluorescent dyes to the molecular labels – green for A, blue for C, yellow for G, red for T – and to read these automatically using optical methods.

This technique works well for relatively short strands of DNA, up to about a thousand bases, but it becomes inaccurate for longer strands. Numerous technical variations on the chain-terminator method have been devised to streamline the process and speed it up. Statistical methods have been developed to improve accuracy, in cases where the patch of fluorescent dye is a bit faint or fuzzy. Automated DNA sequencers can handle 384 DNA samples in a single run, and can carry out about one run per hour. So in a day, you can sequence about 9,000 strands – a little under 10 million bases per sequencer per day. As experience grows and demand for new sequences increases, these numbers are rising rapidly.

At this stage there is a trade-off between two aspects of the problem. Creating the break points and sequencing the resulting fragments are a matter of biochemistry. The cleverer you are when you break the molecule up, the easier it will be to reassemble the pieces. If you cut the chain at ‘known’ locations, and keep track of which pieces are adjacent to them, then reassembly is in principle straightforward. It’s like marking the corresponding ends of the fragments with labels that match.

If convenient break points always existed, this method would be very effective. But often they don’t, and then an alternative is required. Both the HGP and Celera used the ‘shotgun’ method. This breaks the strand into random fragments. Each fragment is sequenced, then they are fitted together by mathematical techniques, implemented on fast computers. The method works because the random fragments sometimes overlap, where the same bit of DNA has been cut in two different places. Those overlaps tell you how the pieces fit together.

In oversimplified terms, it works like this. Suppose you have two pieces

028

and you know that they abut, but not in which order. Then you have to decide whether the correct join is

029

or

030

If you have an overlapping fragment that reads

031

then it fits the second possibility

032

but not the first. In practice, you have a lot of these fragments, and a variety of information about possible ways to reorder them. And the fragments are much longer – but that helps more than it hinders, because the overlaps can be larger, hence less ambiguous.

Again, there are lots of different ways to carry out this strategy for sequencing long DNA strands. All of them rely heavily on computers to do the mathematical calculations and to handle the large quantities of data, but a lot of mathematical thought has to go into sorting out what the computers are instructed to do. A simple example of such methods is the so-called greedy algorithm. Given a collection of fragments, many of them overlapping, first find the pair of fragments with the biggest overlap. Merge them into a single chain and replace them by this new chain. Now repeat. Eventually, many of the fragments will be merged into a single chain; with enough overlaps, all of them. This method does not always lead to the shortest chain that is consistent with all the fragments, and it may not produce the correct assembly. It is also computationally inefficient because at each stage you have to calculate all the overlap sizes for all the pairs of fragments.

As a simple example with much smaller numbers, suppose the fragments are

033

The biggest overlap occurs with CCCCTTAA and TTAAGCGC, which therefore merge to give CCCCTTAAGCGC, and that replaces the first two sequences on this list. The biggest overlap among what’s left is with CCCCTTAAGCGC and GCTTTAAA, which merge to give CCCCTTAAGCGCTTTAAA. The fourth sequence on the list doesn’t overlap this, so it has to be left unconnected until further data are obtained.

The HGP put most of its money on the biochemical step, and first broke up the genome into sequences of about 150,000 bases by cutting it at specific locations. This involves a lot of effort finding enzymes that make suitable cuts, and sometimes these prove elusive. Then each of these fragments was sequenced by the shotgun method. Celera put all of its money on the mathematical step, and applied the shotgun to the entire human genome. Then it used a large number of sequencing machines to find the DNA sequences of the fragments, and assembled them by computer.

You can either use clever chemistry to simplify the maths, or use clever maths to simplify the chemistry. HGP took the first approach, Celera the second. The second turned out to be cheaper and faster, mainly thanks to the enormous power of modern computers and the development of slick mathematical methods. The wisdom of this choice was not entirely obvious at first, because Celera used data from the HGP in its assembly process. But as more and more genomes have been sequenced, it has become clear that whole-genome shotgun is the way to go, at least until something better comes along.

Today, sequencing genomes has become almost routine. Scarcely a week passes without an announcement that a new organism has been sequenced – over 180 species to date. Most are bacteria, but they also include the mosquito responsible for infecting humans with malaria, the honeybee, the dog, the chicken, the mouse, the chimpanzee, the rat and the Japanese spotted green pufferfish. As I write, the latest is a sponge, whose sequence may shed light on the origin of eukaryotes.

In Jurassic Park, dinosaurs were brought back to life by sequencing their DNA, extracted from blood that had been ingested by blood-sucking flies which were then preserved in amber. This fictional technique doesn’t work in reality, because ancient DNA degrades too fast, but over the past few years something similar has been done with DNA that is tens of thousands of years old. In particular, we now have a growing understanding of the Neanderthal genome. Neanderthals, of course, were a rather robust form of hominid that coexisted with early modern humans, between about 130,000 and 30,000 years ago. Until recently, some taxonomists have considered them to be a separate species, Homo neanderthalensis, while others have classified them as a subspecies, H. sapiens neanderthalensis, of H. Sapiens. It is now known that about 4% of people alive today have some DNA sequences derived from the Neanderthal genome, transmitted via a Neanderthal male and a modern-human female. So DNA supports the subspecies classification.

The word ‘gene’ is bandied about as if everyone knows what it means. Genes make you what you are. They explain everything about you. Genes make you fat, they make you homosexual, they cause diseases, they control your destiny.

Genes are magic. Genes perform miracles.

It is worth distinguishing two uses of the word ‘gene’. One is very limited: a gene is a portion of the genome (not necessarily in one connected piece) that codes for one or more proteins. Not so long ago the conventional wisdom would have deleted ‘or more’, but the Human Genome Project revealed that although we have 100,000 different proteins in our bodies, they are specified by only 25,000 genes. Genes often come in several pieces, and the amino acid sequences that these pieces specify can be spliced together in many different ways. So the same gene can, and often does, code for several different proteins.

The second usage of ‘gene’ is extraordinarily broad. It arises from the activities of neo-Darwinists, who reinterpreted Darwinian evolution in terms of DNA. This approach has enormous scientific value, but some of the interpretations associated with it are questionable; and unfortunately these interpretations have become common currency, while only specialists understand the underlying science. In his elegant masterpiece The Blind Watchmaker, Richard Dawkins defined the phrase ‘a gene for X’ to mean ‘any kind of genetic variation that affects X’. Here X is any feature of an organism; Dawkins’ example is ‘tying shoelaces’.

This definition is just about defensible, though it runs into trouble when X is ‘having a blue-eyed mother’. In practice, ‘affects X’ is interpreted as ‘changes to the gene correlate with changes to X’, because cause and effect are often hard to establish. The children of blue-eyed mothers do indeed exhibit genetic variation that correlates with having a blue-eyed mother, so in that sense they ‘have a gene for having a blue-eyed mother’. But in the stricter sense the gene that matters is actually in the mother; the children show genetic variation because they sometimes inherit that gene.

Even ignoring such examples, the second, abstract definition of ‘gene’ can cause problems if it is confused with the first, concrete definition. Predictably, that’s exactly what has happened. Many people now assume that our behavioural quirks and predisposition to various diseases can be traced to specific DNA sequences in our genetic make-up. Newspaper reports of geneticists finding ‘the gene for’ something or other can lead us to believe that the something or other in question is somehow written into our (some of us’s) genes. There are genes – we are told – for blue eyes, cystic fibrosis, obesity, novelty-seeking, susceptibility to heroin addiction, dyslexia, schizophrenia and emotional sensitivity. Analyses of identical twins separated at birth suggest that your genes may even determine what kind of person you marry and what make of car you buy.

Somewhere in my genes, apparently, it says ‘Toyota’. I find this very curious, because according to my birth certificate my genes existed in 1945, but Toyota made no significant exports to the UK until the 1970s.

Alleged ‘genes for’ schizophrenia, alcoholism and aggression have been announced, to a flourish of trumpets, and then quietly withdrawn when subsequent evidence fails to back up the initial assertion. The location of a gene for breast cancer has been claimed several times, not always correctly. Biotechnology companies have fought in court over patent rights to genes that are thought to increase the risk of contracting various diseases.

In 1999 the Guardian newspaper printed an article with the headline “‘Gay gene” theory fails blood test.’4 This story began in 1993 when a segment of human chromosome known as Xq28, inherited from the mother, was implicated in male homosexuality. The initial evidence came from a study of gay male twins and brothers carried out by Dean Hamer and others, which concluded that gay men tend to have more gay relatives, on the maternal side, than heterosexual men do.5 Later, various researchers found that in 40 pairs of gay brothers, the genetic similarities in the Xq28 region were significantly greater than chance. This finding created a global media sensation, and the ‘gay gene’ seemed to have been given sound scientific basis, even though no scientist ever claimed to have pinned anything down to a single gene.

The fateful chromosome segment Xq28 played a central role in Hamer’s book Living with Our Genes, but even before it appeared, serious doubts were surfacing. In particular, other researchers couldn’t replicate Hamer’s results. In 1999 the journal Science carried an article by George Rice and colleagues, who examined blood from 52 pairs of brothers and attempted to confirm the link between Xq28 and homosexuality. They reported: ‘Our data do not support the presence of a gene of large effect influencing sexual orientation at position Xq28.’6

This negative conclusion remains in force today. In fact, the role of individual genes in determining large-scale human characters – those we encounter on a human level – seems to be very small. Leaving aside a few direct connections, such as hair and eye colour, the link between any specific gene and a human-level character is virtually non-existent. As evidence, consider height. There is little doubt that people’s genes play a major role in determining their height: tall parents tend to have tall children. So it is no surprise that, to date, height is the character that has been found to be most closely correlated with the presence or absence of a single gene (again excepting hair colour and the like). What is a surprise, however, is the extent to which this particular gene affects height. It accounts for an astonishing . . . two per cent of the variation in human height.

And that’s the biggest correlation between a single gene and a human character.

How can two competent studies, both using similar methods, lead to such contradictory results? I’m not suggesting that the scientists concerned acted in any way improperly. But there is a mechanism that can easily lead to these kinds of contradictory outcomes, even though the experiments have been performed honestly and competently. It comes from a subtle misinterpretation of statistics.

Statistical methods are used to assess correlations between two data sets. For instance, heart disease and obesity in humans tend to be associated. The degree of correlation can be calculated mathematically; its statistical significance is a measure of how likely it is for such a correlation to have arisen by pure chance. If, for instance, that level of correlation occurs 1% of the time in randomly chosen data sets, then the correlation is said to be significant at the 99% level.

The widespread availability of computer software has rendered virtually effortless a procedure that not long ago required days of work on a desktop calculating device – what we might call the ‘scattershot’ approach to finding significant correlations between genes and characters. Suppose you start out with a list of genes (or DNA segments or regions of the genome) and a list of characters in a sample of people. You now draw up a big table, called a correlation matrix, to find the most significant associations. How often is liver disease associated with the gene Visigoth? How often is being good at football associated with BentSquirrel5? (I’m making up the gene names . . . I hope.) Having done so, you pick the strongest association you can find, and run the relevant data through a statistics package to find out how significant it is. You then declare this association to be statistically significant at the level you have calculated, and publish that particular result, while ignoring all the other pairs of variables that you looked at.

What’s wrong? Why does the next study fail to find any such association? Why would we expect no confirmation? Because you chose a pair of data sets that were unusually closely correlated. You then, in effect, pretended you’d bumped into it at random. It’s like sorting through the pack to find the ace of spades, slapping it on the table, and claiming to have achieved a feat with probability 1/52.

Suppose you are looking at 10 genes and 10 characters. That gives 100 pairs. Of those 100 cross-correlations, given random variation, you expect one – on average – to be ‘significant at the 99% level’ – even if there is no causal connection whatsoever. (Actually, those 100 events won’t be completely independent. A similar criticism holds if that is taken into account, but the mathematics is less transparent.) If you now use the significance criterion to reject the other 99 pairs, and keep the significance level that the package gives you, there you have the fallacy. Not surprisingly, the next independent trial finds no significant association at all. It was never there.

The correct methodology should be to use one group of subjects to home in on a possible connection, but then to check it using a second, independent group (ignoring all data from the first trial and looking only at associations you’ve already chosen via the first trial). Often, however, the first study published in a journal and announced to the media carries out only the first step. Eventually a different team carries out the second step . . . and, surprise surprise, the result can’t be replicated. Unfortunately, it may take quite a while for the second step to be performed, and the mistaken claim to be corrected, because there is little scientific kudos to be gained by repeating other people’s experiments.

The three billion DNA bases in ‘the’ human genome may seem a lot, but in terms of computer storage, it constitutes a mere 825 megabytes of raw data. This is much the same as one music CD. So we are roughly as complex as Sergeant Pepper’s Lonely Hearts Club Band.

Because the information content of the human genome is so small, it is now possible to sequence the genome of an individual at a cost of between $5,000 and $15,000, predicted to drop to $1,000 within a couple of years. (The precise cost depends on how much of the genome is involved and other factors, including why the sequencing is being done and who is doing it.) This ‘personal genomics’ leads into an aspect of the Human Genome Project that was somewhat neglected in the heady rush to obtain the sequence. There is no such thing as the human genome. Different individuals have different alleles (gene variants) at particular genetic locations (such as, but not limited to, hair colour, eye colour and blood type) and also differ in parts of the genome that do not code for proteins, such as the so-called variable tandem repeats, where the same DNA sequence is repeated over and over again. In fact, this is the basis of genetic fingerprinting, which was introduced by forensic scientists as a way to associate DNA traces with their owners. It wouldn’t work if all humans had the same genome.

However, we all have much the same basic framework for our DNA, and this is what the Human Genome Project was actually about. As it turned out, Celera’s genome really was personal: it was partly based on its founder Craig Venter’s own DNA.

In the heady days when the Human Genome Project was first seeking funding, the idea was sold to governments and private investors not as a vital piece of basic science, but as something that would inevitably lead to massive advances in our ability to cure diseases. Once you know ‘the information’ that makes a human being, surely you know everything about that being. Well no, because you’re confusing two different meanings of ‘information’ – what is encoded in DNA, and what you would need to know to put a human being together from scratch. Similarly, the telephone directory gives you ‘the information’ you need to get in touch with someone, but you also need a telephone, and you’re out of luck if they’re away on holiday. (Less so now that we have mobile phones, but you get the point.)

To date, the pay-off from the Human Genome Project, in terms of curing diseases, has been virtually non-existent. This is really no great surprise. For instance, the genetic basis of cystic fibrosis is the gene CFTR. It contains about 250,000 base pairs, but the protein that it encodes (cystic fibrosis transmembrane conductance regulator) is a chain of 1,480 amino acids. Get this chain wrong, and the protein doesn’t work. In 70% of the mutations seen in people with cystic fibrosis, three specific base pairs go missing. This triplet codes for phenylalanine, at position 508 of the protein. Its omission causes cystic fibrosis. The remaining 30% of cystic fibrosis patients have, between them, about a thousand different mutations of CFTR.

Most of this has been known since 1988, but no cure for cystic fibrosis has yet been discovered. Gene therapy, a technique for changing the DNA in the cells of a living human by infecting them with a virus that carries the required sequence, has run into serious trouble after the deaths of several patients. Some forms of this treatment are currently illegal in various countries; however, the technique has had limited success in the treatment of X-linked severe combined immunodeficiency, popularly called bubble-boy syndrome because sufferers have to be isolated from those around them to avoid severe infections.

There is a growing realisation that, with a few standard exceptions, our genes do not cause – or even predict – the diseases that we will contract throughout our lives. The US Government is now taking urgent steps to regulate the activities of personal genomics companies, to prevent the exploitation of inaccurate public perceptions of genes.

As basic science, the Human Genome Project constitutes a huge breakthrough. As a major advance in medicine, it has yet to perform. Even as basic science, its main outcome has been to force major revisions of biologists’ previous assumptions about human genetics. I’ve already mentioned that before the human genome was sequenced, it was believed that there must be about 100,000 genes, in the sense of ‘sequences that code for proteins’. The reason was straightforward: the human body has about 100,000 different proteins. As already remarked, it turned out that only 25,000 or so such genes exist. What we then learned is that genes break up into isolated segments, which can be combined in many ways, so the same gene can code for several proteins. The idea that an individual’s DNA sequence is some kind of dictionary of its proteins turns out to be naive and simplistic.

All this makes the Human Genome Project excellent science: it changes our views. Unfortunately, the resulting picture has turned out to be more complicated than biologists had expected, and it is becoming clear that the gap between sequencing an organism’s DNA and knowing how that organism works is far greater than most people had hoped.