CS498CXZ Assignment #3: Hidden Markov Models
(due Tuesday, Nov. 29, 2005, 12:30pm)
Introduction
Hidden Markov Models (HMMs) are powerful statistical models for modeling
sequential or time-series data, and have been successfully used to solve
many bioinformatics problems such as sequence alignment, protein classification, and gene finding.
This assignment is designed to ensure you have a good understanding of the basics of HMMs.
You will need to complete an existing implementation of
HMMs in C++, and experiment with some toy sequences.
We will provide a great deal of ``basic'' code, so that you will need to only fill in a few
lines for each algorithm. Thus the expected knowledge of C++ is actually at the minimum. In order to implement the algorithms correctly, however, you need to understand precisely some details of the underlying probabilistic model, including the Viterbi algorithm for finding the most likely transition path and how to
estimate HMM parameters using tagged/labeled sequence data.
More specifically, you will be asked to ``fill in'' the required code
for two different tasks (algorithms):
- Compute the most likely path for an observed sequence of symbols from a given HMM using the Viterbi algorithm.
- Estimate an HMM in a supervised way based on a sequence of symbols along
with the observed state transitions.
You will also be asked to test the HMM program on some toy sequences.
The HMM Code Provided
We have provided a significant amount of code in C++ for this assignment.
This includes three files hmm.hpp, hmm.cpp, main.cpp and a makefile.
The code and the necessary sequence data are available here.
The code is almost complete except that several key functions in hmm.cpp are
not completely implemented. Your task is to complete them. A detailed description of
the HMM algorithms can be found in this note.
The program can be run in three different modes, depending on the option given.
- With option "-d", it decodes a given sequence, i.e.,
computes the most likely path of state transitions given the sequence. The required input includes a model file and an untagged sequence file. (See below for how to pass in such information.)
- With option "-c", it counts the relevant events in a given tagged sequence, and saves the model in the model file given. That is, it estimates the HMM parameters in a supervised way. The required input includes the number of states, a model file name and a tagged sequence file.
- With option "-t", it trains an HMM with a raw sequence without tags, and saves the model in the model file given. That is, it estimates the HMM parameters in an unsupervised way. The required input includes the number of states, a model file name and a raw sequence file.
In all cases, the program recognizes the following options:
- "-m filename" specifies a model file (unspecified probabilities are taken as zero)
- "-s filename" specifies a sequence file (tagged or untagged)
- "-n integer" specifies the desired number of states
You need to complete the implementation of the following functions in the file hmm.cpp.
- void Decode(char *seqFile): This function reads in a sequence and computes the most likely state path using the Viterbi algorithm. It would
print out a tagged sequence similar to testtag in the distributed files.
- void CountSequence(char *seqFile): This function reads in a tagged sequence file and estimates all the HMM parameters based on the counting
of the corresponding events in the tagged sequence.
The following is a brief explanation of how an HMM is represented in the provided code.
An HMM is represented by the class Model. We assume a fixed vocabulary that consists of all the 94 ``readable/printable''
ascii characters that are typically on a keyboard. More precisely, the symbol set is fixed as all the 94
ascii characters from [33,!] to [126, ~].
When a symbol is not observed in the data, its count is automatically
set to zero, effectively excluded from the model.
The number of state is stored in a variable N.
The parameters of an N-state HMM are represented by the following arrays:
- Initial state probability ( I): I is an array of length N, with I[s]
representing the initial probabiltiy of being at state s. I is 0-indexed, meaning that the first state has an index of zero, and its initial probability is stored in I[0], rather than I[1].
- State transition probability matrix ( A): A is a 0-indexed two-dimensional array (N X N), with A[i][j]
representing the probability of going from state i from state j.
- Output probability matrix ( B): B is a 0-indexed two-dimensional array (M X N), with B[o][s]
representing the probability of generating output symbol o
at state s. M is the number of unique symbols (currently 94,
stored in SYMNUM). Note that the observed character can not be used directly as an index to access the entries of matrix B; it must
be normalized by subtracting the ``lowest'' character baseChar
defined in the class Model. For example, to find out the probability
of generating 'C' from state 1, use B['C'-baseChar][1].
However, the sequence array used in ComputeAlpha, ComputeBeta, and
AccumulateCounts has already had the normalized index, so you can use
something like B[seq[t]][i] directly without worrying about normalizing
seq[t].
More specific instructions on how to implement them can be found in the embedded
documentation in the provided code. In all the cases, run `` gmake'' to
recompile the code after you change the code.
Tasks
Part I. [45 points] Finding the Most Likely State Path (Viterbi Algorithm)
- 1.1 (25 points) Complete the implementation of the Decode function. The function is
completely implemented except for two assignment statements.
Complete these two assignment statements to make it work. Refer to the embedded
comments in the source code for where these two statements are.
Test your code using the two example model files and sequence files provided.
That is, to run the following command:
% hmm -d -m samplemod1 -s sampleseq1 > sampletag1
% hmm -d -m samplemod2 -s sampleseq2 > sampletag2
Your program should tag every a with ``0'' and every `` b'' with ``1'' for sampleseq1, and all the first eight symbols with ``0'' and the last eight symbols with ``1'' in the case of sampleseq2.
- 1.2 (10 points) Take a look at the model specified in samplemod1 and samplemod2.
Explain briefly why the program should tag the two sequences as it did.
In particular explain briefly why it tagged the first b with ``0'' and
the last a with ``1'' in sampleseq2 even though the output probability distribution given state ``0'' strongly favors a and the output probability distribution given state ``1'' strongly favors b.
- 1.3 (10 points) Modify samplemod2 so that the program would tag
all symbols with ``0'' except for the last two b's, which would be tagged as ``1''. Name the modified model file
as samplemod3.
Part II. [30 points] Supervised Training of HMM
- 2.1 (20 points) Complete the implementation of the function CountSequence. This
function is mostly implemented except for the part of counting the relevant
events for estimating the HMM parameters. You need to complete six assignment statements, two
for each of the following three events.
- How many times it starts with state s_i
- How many times a transition (from state s_i to s_j) has happened
- How many times character c is generated from state s_i
For each of the three counters, we need to update the counter as well as the corresponding
normalizer. Take a look at the function UpdateParameter(), if you are not sure how these counters and normalizers are used in the provided code.
Test your code using the tagged sequence taggedsampleseq1 by running the following command: ( taggedsampleseq1 should be identical to the sampletag2 that you got.)
% hmm -c -n 2 -m samplesupmod1 -s taggedsampleseq1
Take a look at samplesupmod1. Due to smoothing, the model has a non-zero probability for every symbol, but the majority of probability mass should be on "a" and "b". If your implementation is correct,
you should get a model that would tend to generate ``a'' when at state ``0'' and
tend to generate ``b'' when at state ``1''. Test the model samplesupmod1 using
% hmm -d -m samplesupmod1 -s sampleseq2
You should get exactly the same result as sampletag2 or taggedsampleseq1. If not, you have a problem with the implementation.
- 2.2 (10 points) Change the tag value for one symbol in taggedsampleseq1 (name it taggedsampleseq2) so that
when we apply the trained HMM using taggedsampleseq2 to decode
sampleseq2 the result would be different from taggedsampleseq2. (i.e., the trained HMM can't decode the training data perfectly.)
Part III. [25 points] Unsupervised Training of HMM (Baum-Welch Algorithm)
- 3.1 (10 points) Do the following to train an HMM with an untagged sequence sampleseq2
% hmm -t -n 2 -m sampleunsupmod1 -s sampleseq2
Take a look at sampleunsupmod1. Does the learned model capture some
of the patterns in sampleseq2.
Compare the learned sampleunsupmod1 with samplemod2.
Are they similar? Remember that since we are doing unsupervised training,
state ``0'' and ``1'' may switch their roles. Now, use this
sampleunsupmod1 to decode sampleseq2, i.e., do,
% hmm -d -m sampleunsupmod1 -s sampleseq2
What do you get? Did you get the same tagging as sampletag2, which is produced by using samplemod2.
- 3.2 (15 points) sampleseq3 is a mixture of DNA symbols (i.e., A, T, G, C) and amino acid symbols (20 letters excluding B, J, O, U, X,Z). First, examine the file and determine where you think the boundaries should be. Then train an HMM with sampleseq3 by doing
% hmm -t -n 2 -m sampleunsupmod3 -s sampleseq3
and apply the HMM to decode sampleseq3 by doing
% hmm -d -m sampleunsupmod3 -s sampleseq3
Can the HMM identify the boundaries between the DNA symbols and amino acid symbols?
Now, change sampleseq3 by inserting one "P" after the second "A"
(counting from the beginning of the sequence), and try the procedure above again.
Is the inserted "P" tagged as an amino acid or a nucleotide? Why?
What if you insert six "P"'s after the second "A"? How is the result different from
when you insert just one "P"? Why?
What to turn in
A. Please turn in a hardcopy of your written answers at the class.
B. Please email the following files to our grader, Xuehua Shen (xshen AT cs.uiuc.edu)
- Your complete code (i.e.,hmm.cpp). You must make sure it compiles with the provided hmm.hpp and main.cpp; your code may be tested with some new examples.
- The following files: sampletag1, sampletag2, samplemod3, samplesupmod1, taggedsampleseq2, sampleunsupmod1.