CS498CXZ Assignment #2:
Genome Sequencing
(due Sept. 20, 2005, Tuesday, 12:30pm)
- [10 points] Given three strings s1=ATG, s2=ATTG, s3=TGCC, what is the length of
a shortest superstring of s1, s2, and s3? How many different shortest superstrings of s1, s2, and s3
are there in total?
- [15 points] Suppose the length of a sequence S is n*k-n+1 and the length of each read is k.
Prove that there does not exist any set of n-1 reads that have S as the unique shortest superstring.
- [10 points] Let S=CGAGATACGTCATAAAC be a mini genome and our reads are precisely 6 base pairs long. Give a set of 4 particular reads that have S as the unique
shortest superstring. What if S=CAAACAAATTAACAAAC? Can you still find a set of
4 particular reads that have S as the unique shortest superstring? Why?
( Note that an earlier version of this problem incorrectly said that a read is 5 base pairs long.)
- [20 points] Suppose we obtain 10,000 reads from a genome of 1,000,000 base pairs. Each read is 500 base pairs long. Use the Lander-Waterman model to answer the following questions.
- What is the coverage?
- What is the probability that a base pair in the genome is not in any of our reads (i.e., not sequenced)?
- What is the expected number of gaps in the genome?
- What is the probability that a base pair in the genome would occur in at least two reads?
( Note that an earlier version of this problem incorrectly said that the
number of reads is 100,000.)
- [45 points] Use the Eulerian path approach to solve the Sequencing by Hybridization
(SBH) problem for the following spectrum:
S={ATG, GGG, GGT, GTA, GTG, TAT, TGG}
Label edges and vertices of the graph, and give all the possible minimum-length sequences s such that
Spectrum(s,3)=S.
What to turn in
Please turn in a hardcopy of your written answers at the class.