CS 1025 Computer Science Fundamentals I
Assignment 5

Department of Computer Science
University of Western Ontario
Given: November 11, 2007
Due: December 2, 2007

This assignment is out of 60 points.

Paper copies of your work must be handed in to the CS 025 locker (using the assignment submission form) and electronic copies should be submited via the usual Computer Science Department electronic assignment submission system.

Please see the course outline for information on late penalties and the rules of ethical conduct.


Introduction

This assignment will analyze the genetic material of human mitochondrial DNA.

DNA and Bases

Human genetic material is carried in each cell in 23 chromosome pairs and in small bodies in side each cell called mitochondria. The genetic material in the chromosome pairs comes comes from both parents, while the genetic material in the mitochontria is generally thought to come only from the mother. The premise that mitochondrial DNA comes only from the mother is the basis of well-known DNA-based human migration studies.

Natural DNA is formed in oriented strands that can be described as sequences of simple molecules, known as "bases". The bases are cytosine, guanine, adenine and thymine. These are respectively abbreviated 'c', 'g', 'a' and 't'. Analysis of strings representing sequences of these bases is one of the activities in genomics.

DNA in cells has two strands wrapped around each other with the bases of one strand paired with the bases of the other, like a spiral staircase. The base 'g' is always paired with 'c' and 'a' is always paired with 't'.

Base Abbreviation
Cytosinec
Guanine g
Adenine a
Thymine t

Protiens and Codons

DNA is used to in the manufacture of protiens in cells. For our purposes, all we need to know is that protiens are made up of sequences of smaller molecules known as "amino acids". There are 20 different amino acids that can occur in human protiens. DNA encodes protien "blueprints" as sequences of (cgat) bases. It takes a sequence of three bases to encode an amino acid. For example 'tgg' encodes tyrptophan and 'ttt' encodes phenylalanine. A protien is then specified as a sequence of bases, viewed as a sequence of such triples.

A sequence always starts with a "Start" codon (which would normally encode either Isoleucine or Methionine, but in this special position always encodes Methonine.) The sequence ends when a triple encoding "Stop" is encountered.

These triples are known as "codons". Below is a table showing which codons correspond to which amino acids in vertabrate mitochondria.

Amino Acid Abbrev Mitochondiral DNA codons
Alanine Agct gcc gca gcg
Arginine Rcgt cgc cga cgg
Asparagine Naat aac
Aspartic acid Dgat gac
Cysteine Ctgt tgc
Glutamic acid Egaa gag
Glutamine Qcaa cag
Glycine Gggt ggc gga ggg
Histidine Hcat cac
Isoleucine Iatt atc
Leucine Lctt ctc cta ctg tta ttg
Lysine Kaaa aag
Methionine Mata atg
Phenylalanine Fttt ttc
Proline Pcct ccc cca ccg
Serine Stct tcc tca tcg agt agc
Threonine Tact acc aca acg
Tryptophan Wtga tgg
Tyrosine Ytat tac
Valine Vgtt gtc gta gtg
Stop codons *aga agg taa tag

Note that there are 4×4×4 possible triples to represent 20 amino acids, so some amino acids are represented by more than one codon. (Why this is so is an interesting story.) There are also three "stop" codons, any one of which signals the end of a codon sequence.

The Human Mitochondrion

The complete sequence of a human mitochondrion is available from the US National Center for Biotechnology Information at the URL http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?list_uids=17981852. (In case this site becomes inaccessible, a backup link is given at this link.) A table of what the different ranges mean is given in the body of the page and the complete genome consisting of 16571 base pairs is given at the end of the page.

Instructions

For each of the questions below provide

Question 1. DNA representation (20 points)

Write a class DNASequence to represent a strand of DNA as a sequence of bases.

Your class should have a constructor that takes a String containing a sequence of letter 'c', 'g', 'a', 't' and other characters. You must construct a new string containing only the letters 'c', 'g', 'a', 't' appearing in the argument string and ignoring all other characters. (We assume that these will be for human readers, e.g. spacing, offset numbers, etc.)

Your class should also have methods

     int    baseLength();     // number of bases in the strand.
     char   baseAt(int i);    // the base at position i, i = 0..baseLength()-1.
     String baseString();     // cgat string.
that can be used to access the individual bases. The character returned by the baseAt method should be one of 'c', 'g', 'a', 't'.

Your class should also have a method

     DNASequence  complement();
that constructs a new DNASequence with each base replaced by the base it is paired with. That is, it is the original string with the following substitutions: c->g, g->c, a->t, t->a.

Question 2. File Input (20 points)

Write a class DNAExtractor to get a DNA sequence string out of a file.

Notice that the web page with the mitochondiron genome has the DNA data following the word "ORIGIN" and ending with the sequence "//". You can cut and paste the last section of the original web page into a data file, or you can save the original web page to a file. If you save the page, then you will need to delete from the file the string '<a name="sequence_17981852"></a>' that occurs right after 'ORIGIN'. If you want to be fancy, you could ignore all text that occurs inside '<' '>' pairs.

Your class should have a constructor that takes a string for the file-name/path-name. It should also have a method

    String getDNA();
that scans through the file and returns a string consisting of the characters found between "ORIGIN" (in capital letters) and "//".

Question 3. Extracting Protiens (20 points)

Write a class ProteinDNA that extends DNASequence with the following constructors and methods:

    ProteinDNA(DNASequence dna, int startAt);
    int    acidLength();       // Number of amino acids.
    char   acidAt(int i);      // The i-th amino acid.
    String acidString();    
The constructor should extract data from the the argument DNASequence starting at the index and stoping when a stop codon is reached.

Make sure that the acidAt and acidLength methods internally count by groups of three.

The character returned by the acidAt method should be the single letter abbreviation for the amino acid from the table above (e.g. A for Alanine). The acidString method should construct a string of these characters, one for each amino acid in the protein.

Test your program by extracting the genes for protiens from the mitochondrial genome starting at bases 3308 and 4471. (Remember that Java strings are indexed starting at 0 so that you will need to start decoding with baseAt(3307) and baseAt(4470).)

The gene at 3308 is "MT-ND1", for NADH dehydrogenase subunit 1. You should obtain the amino acid sequence
              MPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYG
LLQPFADAMKLFTKEPLKPATSTITLYITAPTLALTIALLLWTPLPMPNPLVNLNLGL
LFILATSSLAVYSILWSGWASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSF
NLSTLITTQEHLWLLLPSWPLAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAG
PFALFFMAEYTNIIMMNTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTA
YPRFRYDQLMHLLWKNFLPLTLALLMWYVSMPITISSIPPQT
The gene at 4471 is "MT-ND2", for NADH dehydrogenase subunit 2. You should obtain the amino acid sequence
              MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVL
TKKMNPRSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMMAM
AMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLNVSLLLTLSIL
SIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNMTILNLTIYIILTTTAFLL
LNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLSLGGLPPLTGFLPKWAIIEEFTKNNSL
IIPTIMATITLLNLYFYLRLIYSTSITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTT
LLLPISPFMLMIL
Note that the start codon, 'att' at position 4471 in the whole sequence, would encode Isoleucine (I) instead of Methionine (M) if it were not in the first position of the gene (as explained above).

Now test your program on the gene MTRNR2 starting at locatin 1673.

Bonus Question. Variance Analysis (20 points)

Examine the genes listed in the source web page and determine for each, which codons are used. How much does the relative frequency of the codons used vary from gene to gene?

More specifically, for each gene create a table of how often which codons are used to encode the amino acids. For example, in one gene the amino acid Lysine might be encoded as 'aaa' 61% of the time and as 'aag' 39% of the time. In another gene, Lysine might be encoded as 'aaa' 18% of the time and as 'aag' 82%. Do this for all the amino acids across several genes in the mitochondrion. Do you see any patterns?