Bioinformatics: Genome Assembly

Assembly-Solving Really Big Puzzles

One of the primary duties of a Bioinformacian is to combine little pieces of DNA into bigger pieces. When scientists sequence the genome of a species, it doesn't spit out of a machine in one magical lump. Sequencing machines (that read DNA sequences) produce lots of little sequences of DNA (strings of A's, T's, G's, or C's) 50-700 base pairs (bps) long. They spit out millions of them. The challenge of bioinformatics is to assemble those millions of short reads into the full sequence of the genome. Imagine shredding a textbook and putting the pieces back together. This process is called (no surprise here) Assembly, since we're assembling pieces of DNA into a larger sequence. This process really made a splash in 2003 when the human genome was sequenced ...

More than one way to Skin a Genome

The Human Genome Project was really a race between two projects, both attempting to assemble the entire human genetic sequence. The older, classical method used by the government's team worked like this:
Colony picker robot.
  • Get DNA from a person
  • Break it into pieces
  • Clone pieces into plasmids 
  • Put plasmids in bacteria
  • Isolate individual bacterial colonies (each colony has one piece of human DNA on a plasmid)
  • Sequence each plasmid from each isolated colony
  • Put pieces together 
This method was highly accurate, but very, very, slow. Each bacterial colony had to be picked by a robot, then stored in refrigerators and accessed later by another robot. Sequencing each individual colony took a long time, and a lot of money.

Shotgut Sequencing 

"He puzzled and puzzed 'til his puzzler was sore ..." - Dr. Suess


A private company led by Craig Venter joined the race in the late 1990s. Their big idea, primarily devised by bioinformatician Eugene Myers, was to [1]:
  • Get DNA from a person
  • Break it into pieces
  • Sequence all the pieces at once
  • Put them back together using a complicated computer program (the first assembler)


Reproduced from reference 1.
At the time, assembling everything at once and putting it back together was thought to be impossible. It certainly couldn't be done by hand. It had to be done by a computer. Eugene's program worked by comparing every piece to every other piece. Each time the pieces were compared to each other, if there was enough overlapping sequence, they were combined.

Example Assembly

Check out this example:

Sequence 1: AATTCGTCGTCGCTCG
Sequence 2: CGAATCGTCGCAATTC

These sequences overlap, like so:
       CGAATCGTCGCAATTC                                            
              AATTCGTCGTCGCTCG

and can be combined into a single sequence:
CGAATCGTCGCAATTCGTCGTCGCTCG

 This was done over, and over and over until many of the small pieces were swallowed up into bigger pieces (bigger sequences are called "contigs" in bioinformatese). Since it is possible that two pieces could overlap by random chance, to diminish the possibility of overlapping two pieces by accident, overlaps had to be big (bigger than the 5 bps in our example). Eugene Myers' team nicknamed different sized contigs: small = rock, smaller = stone, smaller = pebble. By joining contigs one by one, the whole genome could be reconstructed.


The idea is simple enough, right? It's not that the concepts are too difficult that makes this hard in real life; It's the sheer, overwhelming amounts of data. The human genome is 3 BILLION bps long! Craig Venter's team needed a supercomputer to run the software to assemble the Human Genome. Assembly is now a commonplace part of biology, and there are many genomes much larger than 3 billion bps. No wonder CLC Bio has the saying: "Rocket Science is for kids, Bioinformatics is for scientists".


Rocket Science is for Kids - Try Bioinformatics

 



 For more juicy details about the people and science involved in the Human Genome Project, check out this book:

The Genome War: How Craig Venter Tried to Capture the Code of Life and Save the World 
by James Shreeve

1. The original paper where Eugene Myers describes his assembly algorithm is:

Myers EW, Sutton GG, Delcher AL, et al. (2000). A Whole-Genome Assembly of Drosophila. Science, 287:2196-2204.

No comments:

Post a Comment

We are always glad when someone catches a mistake, has more to add, or just likes our work. Let us know about it!