CS 1025 Computer Science Fundamentals I
Assignment 4

Department of Computer Science
University of Western Ontario
Given: November 2, 2011
Due: November 17, 2011

This assignment is out of 60 points.


Introduction

This assignment will analyze the genetic material of human mitochondrial DNA.

DNA and Bases

Human genetic material is carried in each cell in 23 chromosome pairs and in small bodies in side each cell called mitochondria. The genetic material in the chromosome pairs comes comes from both parents, while the genetic material in the mitochontria is generally thought to come only from the mother. The premise that mitochondrial DNA comes only from the mother is the basis of well-known DNA-based human migration studies.

Natural DNA is formed in oriented strands that can be described as sequences of simple molecules, known as "bases". The bases are cytosine, guanine, adenine and thymine. These are respectively abbreviated 'c', 'g', 'a' and 't'. Analysis of strings representing sequences of these bases is one of the activities in genomics.

DNA in cells has two strands wrapped around each other with the bases of one strand paired with the bases of the other, like a spiral staircase. The base 'g' is always paired with 'c' and 'a' is always paired with 't'.

Base Abbreviation
Cytosinec
Guanine g
Adenine a
Thymine t

Protiens and Codons

DNA is used to in the manufacture of protiens in cells. For our purposes, all we need to know is that protiens are made up of sequences of smaller molecules known as "amino acids". There are 20 different amino acids that can occur in human protiens. DNA encodes protien "blueprints" as sequences of (cgat) bases. It takes a sequence of three bases to encode an amino acid. For example 'tgg' encodes tyrptophan and 'ttt' encodes phenylalanine. A protien is then specified as a sequence of bases, viewed as a sequence of such triples.

A sequence always starts with a "Start" codon (which would normally encode either Isoleucine or Methionine, but in this special position always encodes Methonine.) The sequence ends when a triple encoding "Stop" is encountered.

These triples are known as "codons". Below is a table showing which codons correspond to which amino acids in vertabrate mitochondria.

Amino Acid Abbrev Mitochondiral DNA codons
Alanine Agct gcc gca gcg
Arginine Rcgt cgc cga cgg
Asparagine Naat aac
Aspartic acid Dgat gac
Cysteine Ctgt tgc
Glutamic acid Egaa gag
Glutamine Qcaa cag
Glycine Gggt ggc gga ggg
Histidine Hcat cac
Isoleucine Iatt atc
Leucine Lctt ctc cta ctg tta ttg
Lysine Kaaa aag
Methionine Mata atg
Phenylalanine Fttt ttc
Proline Pcct ccc cca ccg
Serine Stct tcc tca tcg agt agc
Threonine Tact acc aca acg
Tryptophan Wtga tgg
Tyrosine Ytat tac
Valine Vgtt gtc gta gtg
Stop codons *aga agg taa tag

Note that there are 4×4×4 possible triples to represent 20 amino acids, so some amino acids are represented by more than one codon. (Why this is so is an interesting story.) There are also three "stop" codons, any one of which signals the end of a codon sequence.

The Human Mitochondrion

The complete sequence of a human mitochondrion is available from the US National Center for Biotechnology Information at the URL http://www.ncbi.nlm.nih.gov/nuccore/251831106?report=genbank. A table of what the different ranges mean is given in the body of the page and the complete genome consisting of 16569 base pairs is given at the end of the page. That data has been placed in the file Mitochondrion.txt.

Instructions

For each of the questions below provide

Question 1. DNA representation (20 points)

Write a class DNASequence to represent a strand of DNA as a sequence of bases.

Your class should have a constructor that takes a String containing a sequence of letter 'c', 'g', 'a', 't' and other characters. You must construct a new string containing only the letters 'c', 'g', 'a', 't' appearing in the argument string and ignoring all other characters. (We assume that these will be for human readers, e.g. spacing, offset numbers, etc.)

Your class should also have methods

     public int    baseLength();     // number of bases in the strand.
     public char   baseAt(int i);    // the base at position i, i = 0..baseLength()-1.
     public String baseString();     // cgat string.
that can be used to access the individual bases. The character returned by the baseAt method should be one of 'c', 'g', 'a', 't'.

Your class should also have a method

     public DNASequence  complement();
that constructs a new DNASequence with each base replaced by the base it is paired with. That is, it is the original string with the following substitutions: c->g, g->c, a->t, t->a.

Final note: Your program should check for valid input. The "Mitochondrion.txt" file contains a letter "n" near position 3100. Biologists use this as a short-hand to mean "any of c, g, a, t".

Question 2. File Input (20 points)

Write a class DNAExtractor to get a DNA sequence string out of a file.

Notice that the web page with the mitochondiron genome has the DNA data following the word "ORIGIN" and ending with the sequence "//". You can cut and paste the last section of the original web page into a data file, or you can use the file Mitochondrion.txt.

Your class should have a constructor that takes a string for the file-name/path-name, i.e.

    public DNAExtractor(String filename) { ... }
It should also have a method
    public String getDNA() { ... }
that scans through the file and returns a string consisting of the characters found between "ORIGIN" (in capital letters) and "//". An example of how to read from files is given here. .

Question 3. Extracting Protiens (20 points)

Write a class ProteinDNA that extends DNASequence with the following constructors and methods:

    public ProteinDNA(DNASequence dna, int startAt) {...}
    public int    acidLength()  {...}      // Number of amino acids.
    public char   acidAt(int i) {...}      // The i-th amino acid.
    public String acidString()  {...}    
The constructor should extract data from the the argument DNASequence starting at the index and stoping when a stop codon is reached.

Make sure that the acidAt and acidLength methods internally count by groups of three.

The character returned by the acidAt method should be the single letter abbreviation for the amino acid from the table above (e.g. A for Alanine). The acidString method should construct a string of these characters, one for each amino acid in the protein.

Test your program by extracting the genes for protiens from the mitochondrial genome starting at bases 3307 and 4470. (Remember that Java strings are indexed starting at 0 so that you will need to start decoding with baseAt(3306) and baseAt(4469).)

The gene at 3307 is "MTND1", for NADH dehydrogenase subunit 1. You should obtain the amino acid sequence
              MPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYG
LLQPFADAMKLFTKEPLKPATSTITLYITAPTLALTIALLLWTPLPMPNPLVNLNLGL
LFILATSSLAVYSILWSGWASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSF
NLSTLITTQEHLWLLLPSWPLAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAG
PFALFFMAEYTNIIMMNTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTA
YPRFRYDQLMHLLWKNFLPLTLALLMWYVSMPITISSIPPQT

The gene at 4470 is "MTND2", for NADH dehydrogenase subunit 2. You should obtain the amino acid sequence
              MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVL
TKKMNPRSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMMAM
AMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLNVSLLLTLSIL
SIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNMTILNLTIYIILTTTAFLL
LNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLSLGGLPPLTGFLPKWAIIEEFTKNNSL
IIPTIMATITLLNLYFYLRLIYSTSITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTT
LLLPISPFMLMIL
Note that the start codon, 'att' at position 4470 in the whole sequence, would encode Isoleucine (I) instead of Methionine (M) if it were not in the first position of the gene (as explained above).

Now test your program on the gene MTRNR2 starting at locatin 1671.

Question 4. Complementary Genes (20 points) (Bonus)

Write a program to transcribe complementary genes, for example the gene "MT-TE" at location 14674.

Question 5. Counting Codons (20 points) (Bonus)

Examine the genes listed in the source web page and count how many times each of the 64 different codons occur in each of them. Display the results for each gene and as a grand total. You should give your program an array containing the starting positions of all the genes (as found the web page) and it should do the rest by itself.

Question 6. Most Common Codons (20 points) (Bonus)

Write a program that finds, for each amino acid, which codon most often encodes it. As with Bonus Question 5, you should give your program an array containing the starting positions of all the genes (as found the web page) and it should do the rest by itself.