CS461/661: Bioinformatics Tools and Applications

Genome Sequencing

Prof. Bin Ma

Spring 2006

Credits:

This page contains material originally from the following websites:  http://nema.cap.ed.ac.uk/teaching/genomics/tech1/class1.html and was modified to be used at CS461/661 at UWO.


Practical Session

In this practical session you will be taking a set of sequence reads derived from a mitochondrial genome, and using one of the programs used by the big genome sequencing centres to "assemble" these sequence reads into a complete genome. The mitochondrial genome is circular (in animals) and ~14000 base pairs in length. It encodes 2 ribosomal RNAs and 12 or 13 proteins (depending on which species is being examined). It also encodes 20 transfer RNAs.


The Brugia malayi mitochondrion

a circular molecule, with 12 protein coding genes and two rRNAs, just under 13,600 base pairs in length.

The sequencing reads were generated by Jennifer Daub, a research scientist in ICAPB, University of Edinburgh, as part of the Brugia malayi genome project. See the Brugia malayi genome project (http://nema.cap.ed.ac.uk/fgp.html) home page for more information on this parasitic nematode.

Brugia malayi is a type of parasite that can cause a disease called Elephantiasis, currently affecting 120 million people in the world.

The reads are each 300-600 bases long, and have been pre-trimmed of vector and "low quality" sequence.

You will use the program "Cap" (for contig assembly program) which is available through the world wide web.


1 IN A NEW BROWSER WINDOW Open the data file reads.html (http://nema.cap.ed.ac.uk/teaching/genomics/tech1/class.html)

In the file are 40 sequences, derived from performing sequencing reads on clones.

Each clone is named "cloneX" where X is a number, and each sequence is labelled _F or _R depending on whether it was derived from using a primer directed to the left or forward end or the right or reverse end of the insert.

Thus "clone123_F" and "clone123_R" form a "read pair" from each end of a single clone.

Each of the 40 sequences is in "fasta" format.

Fasta format is a simple format for DNA and protein sequences. The first line is a ">" symbol, followed by a description of the sequence (a definition line) and then a carriage return/new line. The definition in this case includes the sequence name, eg "clone123_R.fa : TYPE = DNA". The next lines are the sequence, and the end of each sequence record is simply the next ">" symbol.


2 IN A NEW BROWSER WINDOW Open a link to the CAP3 analysis server at Pôle Bio-Informatique Lyonnais

Pôle Bio-Informatique Lyonnais is a Bioinformatics Server based in Lyon, France.


3 Select-all of the sequence data in reads.html from the first browser window and paste it into the window "Enter your sequences in FASTA format (no more than 50 kb):".

Select ONLY the fasta sequences and not the header...


4 Select "SUBMIT" by clicking on the button .

Your data will be sent to the remote computing facility which will carry out the alignment.

Depending on the level of other users on these computers, the assembly will take from 2-10 minutes.

Please be patient while this happens: DON'T keep clicking on the page while it is loading!


5 Examine the results of the assembly.


Contigs is the assembled sequences in fasta format

a "contig" is a set of contiguous sequences

How many contigs did you get?

Each line is 60 bases long: how much sequence in total has been assembled?

To assemble the sequences, the program first compares each individual sequence to all the others, remembering which had significant matches. It then attempts to align all the ones with matches together.

We have used the default parameters for the assembly, but it is possible to change these: for example, what length and %identity is considered a "match".

The parameters used were:

CAP Multiple Sequence Alignment Parameters
Remove bad 5'/3' regions : No 
Trim the Sequence/Cloning Vector : No 
Assembly Algorithm : Optimized for reads <= 450 bp

Single sequences is a list of the single sequences (called "singletons") that didn't assemble into contigs.

How many sequences didn't assemble?


Assembly details gives an overview of the way cap3 put the sequence files together.

 

In the table describing the contigs, the read name is given, followed by a "+" or a "-": these symbols indicate the orientation of the read in the contig.

Below the table are the alignments for each contig. Look at these.


6 Questions for you to answer....

a What did the program have to do?

If you were to write a program that performs the assembling, what procedures will you need?

How many sequence comparisons did it have to carry out?

What problem the longer repeats in genomes will cause to such an assembler? 

Why shorter repeats do not cause these problems?

b Draw maps of the sequence reads and their place in the contigs.

For example, here is a map of a short contig showing left and right reads with different coloured arrows.

I have aligned at the same level reads from the same clone.

You should add a scale in base pairs.

 

c Is the assembly a good one? Why did you get more than one contig?

Is the genome assembly complete?

What does the program do when there are "N" residues (undetermined bases) in the sequences?

What does the program do when there are discrepancies between reads?  (Hint: modify a read and redo the assembly and find out.)

Which areas of the genome are more trustworthy than the others?

d Completing this small genome...

Can you see any way to link the contigs, using either the sequence data you have, or the clone information?

Look at the clone names: each has both left and right reads (for some clones only one is available). The left and right reads from one clone should assemble with the correct relative orientation in the final genome.

The map above suggests that by finding read 4F I might be able to link this contig with another to the right, as 4R and 4F come from the same clone. Similarly, finding 9R and / or 23F would link the left end of this contig.

 

 

 


here is a precomputed set of cap results for these files....