CS498CXZ Assignment #1: Biological Database Exploration
(due Sept. 13, 2005, Tuesday, 12:30pm)

  1. [30 points] Genbank is a database of nucleotide sequences. It can be accessed at the NCBI website (National Center for Biotechnology Information) at http://www.ncbi.nlm.nih.gov/.   In the search pull down menu at the top, make sure "nucleotide" is selected.   In the text box at the top of the screen where it solicits input for searching, type "GFP" and hit the Go button. (GFP (= "green fluorescent protein") is a well studied protein.)

    This search will bring up over 1000 results.  To narrow the search, click on "Limits" just below the box where you typed "GFP".  Limit the search to "gene name" (in the dropdown box) and click the "Go" button again.  You will now have approximately 50 results.  Go to the end of the list (you will have to click "next" one time (the "next" link appears to the right). The last two entries, M62653 and M62654, are from a seminal 1992 paper.  Click on M62653, look over the Genbank record, and answer the following questions:

    1. How long is the nucleotide sequence?
    2. How long is its coding region?
    3. Which division is the nucleotide sequence in?
    4. What is the accession ID of this sequence?
    5. What is the ID of the protein it codes?
    6. What are the first nine nucleotides in the coding region?
    7. What are the last three nucleotides in the coding region?
    8. What is the stop codon in this nucleotide sequence?
    9. What are the first five amino acides in the implied protein sequence?
    10. The subsequence "atgtccaga" is outside the coding region in this sequence. If this subsequence were within the coding region, what amino acid sequence would it encode?


  2. [50 points] Now select "FASTA" under the "display" menu, you should now see the FASTA format of the sequence. Choose "File" from the menu that shows "Send to" and save the sequence into a file so that you can further analyze it using your own program.
    1. Write a program to obtain the counts of all the four nucleotides (i.e., A, T, G, and C).
    2. Assume that the sequence represents a sample of values of a random variable X that follows a multinomial distribution. That is, X is one of A, T, G, or C with potentially different probabilities.Use the maximum likelihood estimator (refer to the lecture slides for how to compute the maximum likelihood estimate for a multinomial distribution) to compute the estimated multinomial distribution parameters (i.e., p(X=A), p(X=T), p(X=G), and p(X=C)).
    3. Use log base 2, compute the entropy of X.
    4. Compute the KL-divergence of the estimated multinomial distribution and the uniform multinomial distribution (i.e., D(X || X'), where p(X'=A)=p(X'=T)=p(X'=G)=p(X'=C)= 1/4).


  3. [20 points] Now go back to the Entrez search interface and repeat the search for GFP. This time, click on the result labeled "M62654" (instead of "M62653" as you did earlier). Under "Display", make sure you select "GenBank" instead of "FASTA" so that you can see more fields of the record. Now do the following:
    1. Click on "Aequorea victoria" under "ORGANISM", which will bring you to the taxonomy page. What is the Taxonomy ID on this page?
    2. Under "CDS", click on the protein ID, you would get a page about the encoded protein.
    3. Now click on "Conserved Domains" on the top right corner of this page, which will bring you to a protein conserved domain summary page.
    4. Click on "Show details" to see more information about this domain.
    5. What is the score of the best matching domain (i.e., pfam01353)?
    6. Examine the alignment between the query sequence (i.e., part of our protein sequence) and the "Sbjct" sequence, what is the longest common substring shared by them? Note that the red font indicates shared amino acids.

What to turn in

Please turn in a hardcopy of your written answers at the class.