This assignment is out of 60 points.
This assignment will analyze the genetic material of human mitochondrial DNA.
Human genetic material is carried in each cell in 23 chromosome pairs and in small bodies in side each cell called mitochondria. The genetic material in the chromosome pairs comes comes from both parents, while the genetic material in the mitochontria is generally thought to come only from the mother. The premise that mitochondrial DNA comes only from the mother is the basis of well-known DNA-based human migration studies.
Natural DNA is formed in oriented strands that can be described as sequences of simple molecules, known as "bases". The bases are cytosine, guanine, adenine and thymine. These are respectively abbreviated 'c', 'g', 'a' and 't'. Analysis of strings representing sequences of these bases is one of the activities in genomics.
DNA in cells has two strands wrapped around each other with the bases of one strand paired with the bases of the other, like a spiral staircase. The base 'g' is always paired with 'c' and 'a' is always paired with 't'.
Base | Abbreviation |
---|---|
Cytosine | c |
Guanine | g |
Adenine | a |
Thymine | t |
DNA is used to in the manufacture of protiens in cells. For our purposes, all we need to know is that protiens are made up of sequences of smaller molecules known as "amino acids". There are 20 different amino acids that can occur in human protiens. DNA encodes protien "blueprints" as sequences of (cgat) bases. It takes a sequence of three bases to encode an amino acid. For example 'tgg' encodes tyrptophan and 'ttt' encodes phenylalanine. A protien is then specified as a sequence of bases, viewed as a sequence of such triples.
A sequence always starts with a "Start" codon (which would normally encode either Isoleucine or Methionine, but in this special position always encodes Methonine.) The sequence ends when a triple encoding "Stop" is encountered.
These triples are known as "codons". Below is a table showing which codons correspond to which amino acids in vertabrate mitochondria.
Amino Acid | Abbrev | Mitochondiral DNA codons |
---|---|---|
Alanine | A | gct gcc gca gcg |
Arginine | R | cgt cgc cga cgg |
Asparagine | N | aat aac |
Aspartic acid | D | gat gac |
Cysteine | C | tgt tgc |
Glutamic acid | E | gaa gag |
Glutamine | Q | caa cag |
Glycine | G | ggt ggc gga ggg |
Histidine | H | cat cac |
Isoleucine | I | att atc |
Leucine | L | ctt ctc cta ctg tta ttg |
Lysine | K | aaa aag |
Methionine | M | ata atg |
Phenylalanine | F | ttt ttc |
Proline | P | cct ccc cca ccg |
Serine | S | tct tcc tca tcg agt agc |
Threonine | T | act acc aca acg |
Tryptophan | W | tga tgg |
Tyrosine | Y | tat tac |
Valine | V | gtt gtc gta gtg |
Stop codons | * | aga agg taa tag |
Note that there are 4×4×4 possible triples to represent 20 amino acids, so some amino acids are represented by more than one codon. (Why this is so is an interesting story.) There are also three "stop" codons, any one of which signals the end of a codon sequence.
The complete sequence of a human mitochondrion is available from the US National Center for Biotechnology Information at the URL http://www.ncbi.nlm.nih.gov/nuccore/251831106?report=genbank. A table of what the different ranges mean is given in the body of the page and the complete genome consisting of 16569 base pairs is given at the end of the page. That data has been placed in the file Mitochondrion.txt.
For each of the questions below provide
Write a class DNASequence to represent a strand of DNA as a sequence of bases.
Your class should have a constructor that takes a String containing a sequence of letter 'c', 'g', 'a', 't' and other characters. You must construct a new string containing only the letters 'c', 'g', 'a', 't' appearing in the argument string and ignoring all other characters. (We assume that these will be for human readers, e.g. spacing, offset numbers, etc.)
Your class should also have methods
public int baseLength(); // number of bases in the strand. public char baseAt(int i); // the base at position i, i = 0..baseLength()-1. public String baseString(); // cgat string.that can be used to access the individual bases. The character returned by the baseAt method should be one of 'c', 'g', 'a', 't'.
Your class should also have a method
public DNASequence complement();that constructs a new DNASequence with each base replaced by the base it is paired with. That is, it is the original string with the following substitutions: c->g, g->c, a->t, t->a.
Final note: Your program should check for valid input. The "Mitochondrion.txt" file contains a letter "n" near position 3100. Biologists use this as a short-hand to mean "any of c, g, a, t".
Write a class DNAExtractor to get a DNA sequence string out of a file.
Notice that the web page with the mitochondiron genome has the DNA data following the word "ORIGIN" and ending with the sequence "//". You can cut and paste the last section of the original web page into a data file, or you can use the file Mitochondrion.txt.
Your class should have a constructor that takes a string for the file-name/path-name, i.e.
public DNAExtractor(String filename) { ... }It should also have a method
public String getDNA() { ... }that scans through the file and returns a string consisting of the characters found between "ORIGIN" (in capital letters) and "//". An example of how to read from files is given here. .
Write a class ProteinDNA that extends DNASequence with the following constructors and methods:
public ProteinDNA(DNASequence dna, int startAt) {...} public int acidLength() {...} // Number of amino acids. public char acidAt(int i) {...} // The i-th amino acid. public String acidString() {...}The constructor should extract data from the the argument DNASequence starting at the index and stoping when a stop codon is reached.
Make sure that the acidAt and acidLength methods internally count by groups of three.
The character returned by the acidAt method should be the single letter abbreviation for the amino acid from the table above (e.g. A for Alanine). The acidString method should construct a string of these characters, one for each amino acid in the protein.
Test your program by extracting the genes for protiens from the mitochondrial genome starting at bases 3307 and 4470. (Remember that Java strings are indexed starting at 0 so that you will need to start decoding with baseAt(3306) and baseAt(4469).)
The gene at 3307 is "MTND1", for NADH dehydrogenase subunit 1. You should obtain the amino acid sequenceMPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYG LLQPFADAMKLFTKEPLKPATSTITLYITAPTLALTIALLLWTPLPMPNPLVNLNLGL LFILATSSLAVYSILWSGWASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSF NLSTLITTQEHLWLLLPSWPLAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAG PFALFFMAEYTNIIMMNTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTA YPRFRYDQLMHLLWKNFLPLTLALLMWYVSMPITISSIPPQTThe gene at 4470 is "MTND2", for NADH dehydrogenase subunit 2. You should obtain the amino acid sequence
MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVL TKKMNPRSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMMAM AMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLNVSLLLTLSIL SIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNMTILNLTIYIILTTTAFLL LNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLSLGGLPPLTGFLPKWAIIEEFTKNNSL IIPTIMATITLLNLYFYLRLIYSTSITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTT LLLPISPFMLMILNote that the start codon, 'att' at position 4470 in the whole sequence, would encode Isoleucine (I) instead of Methionine (M) if it were not in the first position of the gene (as explained above).
Now test your program on the gene MTRNR2 starting at locatin 1671.
Write a program to transcribe complementary genes, for example the gene "MT-TE" at location 14674.
Examine the genes listed in the source web page and count how many times each of the 64 different codons occur in each of them. Display the results for each gene and as a grand total. You should give your program an array containing the starting positions of all the genes (as found the web page) and it should do the rest by itself.
Write a program that finds, for each amino acid, which codon most often encodes it. As with Bonus Question 5, you should give your program an array containing the starting positions of all the genes (as found the web page) and it should do the rest by itself.