CS498CXZ Assignment #2: Genome Sequencing
(due Sept. 20, 2005, Tuesday, 12:30pm)

  1. [10 points] Given three strings s1=ATG, s2=ATTG, s3=TGCC, what is the length of a shortest superstring of s1, s2, and s3? How many different shortest superstrings of s1, s2, and s3 are there in total?
  2. [15 points] Suppose the length of a sequence S is n*k-n+1 and the length of each read is k. Prove that there does not exist any set of n-1 reads that have S as the unique shortest superstring.
  3. [10 points] Let S=CGAGATACGTCATAAAC be a mini genome and our reads are precisely 6 base pairs long. Give a set of 4 particular reads that have S as the unique shortest superstring. What if S=CAAACAAATTAACAAAC? Can you still find a set of 4 particular reads that have S as the unique shortest superstring? Why? ( Note that an earlier version of this problem incorrectly said that a read is 5 base pairs long.)
  4. [20 points] Suppose we obtain 10,000 reads from a genome of 1,000,000 base pairs. Each read is 500 base pairs long. Use the Lander-Waterman model to answer the following questions.
    1. What is the coverage?
    2. What is the probability that a base pair in the genome is not in any of our reads (i.e., not sequenced)?
    3. What is the expected number of gaps in the genome?
    4. What is the probability that a base pair in the genome would occur in at least two reads?
    ( Note that an earlier version of this problem incorrectly said that the number of reads is 100,000.)
  5. [45 points] Use the Eulerian path approach to solve the Sequencing by Hybridization (SBH) problem for the following spectrum:
    S={ATG, GGG, GGT, GTA, GTG, TAT, TGG}
    Label edges and vertices of the graph, and give all the possible minimum-length sequences s such that Spectrum(s,3)=S.

What to turn in

Please turn in a hardcopy of your written answers at the class.