CS 1025 Computer Science Fundamentals I
Assignment 5

Department of Computer Science
University of Western Ontario
Given: November 11, 2007
Due: December 2, 2007

This assignment is out of 60 points.

Paper copies of your work must be handed in to the CS 025 locker (using the assignment submission form) and electronic copies should be submited via the usual Computer Science Department electronic assignment submission system.

Please see the course outline for information on late penalties and the rules of ethical conduct.

Introduction

This assignment will analyze the genetic material of human mitochondrial DNA.

DNA and Bases

Human genetic material is carried in each cell in 23 chromosome pairs and in small bodies in side each cell called mitochondria. The genetic material in the chromosome pairs comes comes from both parents, while the genetic material in the mitochontria is generally thought to come only from the mother. The premise that mitochondrial DNA comes only from the mother is the basis of well-known DNA-based human migration studies.

Natural DNA is formed in oriented strands that can be described as sequences of simple molecules, known as "bases". The bases are cytosine, guanine, adenine and thymine. These are respectively abbreviated 'c', 'g', 'a' and 't'. Analysis of strings representing sequences of these bases is one of the activities in genomics.

DNA in cells has two strands wrapped around each other with the bases of one strand paired with the bases of the other, like a spiral staircase. The base 'g' is always paired with 'c' and 'a' is always paired with 't'.

Base	Abbreviation
Cytosine	c
Guanine	g
Adenine	a
Thymine	t

Protiens and Codons

DNA is used to in the manufacture of protiens in cells. For our purposes, all we need to know is that protiens are made up of sequences of smaller molecules known as "amino acids". There are 20 different amino acids that can occur in human protiens. DNA encodes protien "blueprints" as sequences of (cgat) bases. It takes a sequence of three bases to encode an amino acid. For example 'tgg' encodes tyrptophan and 'ttt' encodes phenylalanine. A protien is then specified as a sequence of bases, viewed as a sequence of such triples.

A sequence always starts with a "Start" codon (which would normally encode either Isoleucine or Methionine, but in this special position always encodes Methonine.) The sequence ends when a triple encoding "Stop" is encountered.

These triples are known as "codons". Below is a table showing which codons correspond to which amino acids in vertabrate mitochondria.

Amino Acid	Abbrev	Mitochondiral DNA codons
Alanine	A	`gct gcc gca gcg`
Arginine	R	`cgt cgc cga cgg`
Asparagine	N	`aat aac`
Aspartic acid	D	`gat gac`
Cysteine	C	`tgt tgc`
Glutamic acid	E	`gaa gag`
Glutamine	Q	`caa cag`
Glycine	G	`ggt ggc gga ggg`
Histidine	H	`cat cac`
Isoleucine	I	`att atc`
Leucine	L	`ctt ctc cta ctg tta ttg`
Lysine	K	`aaa aag`
Methionine	M	`ata atg`
Phenylalanine	F	`ttt ttc`
Proline	P	`cct ccc cca ccg`
Serine	S	`tct tcc tca tcg agt agc`
Threonine	T	`act acc aca acg`
Tryptophan	W	`tga tgg`
Tyrosine	Y	`tat tac`
Valine	V	`gtt gtc gta gtg`
Stop codons	*	`aga agg taa tag`

Note that there are 4×4×4 possible triples to represent 20 amino acids, so some amino acids are represented by more than one codon. (Why this is so is an interesting story.) There are also three "stop" codons, any one of which signals the end of a codon sequence.

The Human Mitochondrion

The complete sequence of a human mitochondrion is available from the US National Center for Biotechnology Information at the URL http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?list_uids=17981852. (In case this site becomes inaccessible, a backup link is given at this link.) A table of what the different ranges mean is given in the body of the page and the complete genome consisting of 16571 base pairs is given at the end of the page.

Instructions

For each of the questions below provide

an implementation
documentation of each class and each method
test cases (using a driver program and constructors)
inputs and outputs for each program run

Question 1. DNA representation (20 points)

Write a class DNASequence to represent a strand of DNA as a sequence of bases.

Your class should have a constructor that takes a String containing a sequence of letter 'c', 'g', 'a', 't' and other characters. You must construct a new string containing only the letters 'c', 'g', 'a', 't' appearing in the argument string and ignoring all other characters. (We assume that these will be for human readers, e.g. spacing, offset numbers, etc.)

Your class should also have methods

     int    baseLength();     // number of bases in the strand.
     char   baseAt(int i);    // the base at position i, i = 0..baseLength()-1.
     String baseString();     // cgat string.

that can be used to access the individual bases. The character returned by the baseAt method should be one of 'c', 'g', 'a', 't'.

Your class should also have a method

     DNASequence  complement();

that constructs a new DNASequence with each base replaced by the base it is paired with. That is, it is the original string with the following substitutions: c->g, g->c, a->t, t->a.

Question 2. File Input (20 points)

Write a class DNAExtractor to get a DNA sequence string out of a file.

Notice that the web page with the mitochondiron genome has the DNA data following the word "ORIGIN" and ending with the sequence "//". You can cut and paste the last section of the original web page into a data file, or you can save the original web page to a file. If you save the page, then you will need to delete from the file the string '<a name="sequence_17981852"></a>' that occurs right after 'ORIGIN'. If you want to be fancy, you could ignore all text that occurs inside '<' '>' pairs.

Your class should have a constructor that takes a string for the file-name/path-name. It should also have a method

    String getDNA();

that scans through the file and returns a string consisting of the characters found between "ORIGIN" (in capital letters) and "//".

Question 3. Extracting Protiens (20 points)

Write a class ProteinDNA that extends DNASequence with the following constructors and methods:

    ProteinDNA(DNASequence dna, int startAt);
    int    acidLength();       // Number of amino acids.
    char   acidAt(int i);      // The i-th amino acid.
    String acidString();

The constructor should extract data from the the argument DNASequence starting at the index and stoping when a stop codon is reached.

Make sure that the acidAt and acidLength methods internally count by groups of three.

The character returned by the acidAt method should be the single letter abbreviation for the amino acid from the table above (e.g. A for Alanine). The acidString method should construct a string of these characters, one for each amino acid in the protein.

Test your program by extracting the genes for protiens from the mitochondrial genome starting at bases 3308 and 4471. (Remember that Java strings are indexed starting at 0 so that you will need to start decoding with baseAt(3307) and baseAt(4470).)

The gene at 3308 is "MT-ND1", for NADH dehydrogenase subunit 1. You should obtain the amino acid sequence

              MPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYG
LLQPFADAMKLFTKEPLKPATSTITLYITAPTLALTIALLLWTPLPMPNPLVNLNLGL
LFILATSSLAVYSILWSGWASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSF
NLSTLITTQEHLWLLLPSWPLAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAG
PFALFFMAEYTNIIMMNTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTA
YPRFRYDQLMHLLWKNFLPLTLALLMWYVSMPITISSIPPQT

The gene at 4471 is "MT-ND2", for NADH dehydrogenase subunit 2. You should obtain the amino acid sequence

              MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVL
TKKMNPRSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMMAM
AMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLNVSLLLTLSIL
SIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNMTILNLTIYIILTTTAFLL
LNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLSLGGLPPLTGFLPKWAIIEEFTKNNSL
IIPTIMATITLLNLYFYLRLIYSTSITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTT
LLLPISPFMLMIL

Note that the start codon, 'att' at position 4471 in the whole sequence, would encode Isoleucine (I) instead of Methionine (M) if it were not in the first position of the gene (as explained above).

Now test your program on the gene MTRNR2 starting at locatin 1673.

Bonus Question. Variance Analysis (20 points)

Examine the genes listed in the source web page and determine for each, which codons are used. How much does the relative frequency of the codons used vary from gene to gene?

More specifically, for each gene create a table of how often which codons are used to encode the amino acids. For example, in one gene the amino acid Lysine might be encoded as 'aaa' 61% of the time and as 'aag' 39% of the time. In another gene, Lysine might be encoded as 'aaa' 18% of the time and as 'aag' 82%. Do this for all the amino acids across several genes in the mitochondrion. Do you see any patterns?

CS 1025 Computer Science Fundamentals IAssignment 5