CS498CXZ Assignment #3: Hidden Markov Models
(due Tuesday, Nov. 29, 2005, 12:30pm)

Introduction

Hidden Markov Models (HMMs) are powerful statistical models for modeling sequential or time-series data, and have been successfully used to solve many bioinformatics problems such as sequence alignment, protein classification, and gene finding. This assignment is designed to ensure you have a good understanding of the basics of HMMs. You will need to complete an existing implementation of HMMs in C++, and experiment with some toy sequences. We will provide a great deal of ``basic'' code, so that you will need to only fill in a few lines for each algorithm. Thus the expected knowledge of C++ is actually at the minimum. In order to implement the algorithms correctly, however, you need to understand precisely some details of the underlying probabilistic model, including the Viterbi algorithm for finding the most likely transition path and how to estimate HMM parameters using tagged/labeled sequence data.
More specifically, you will be asked to ``fill in'' the required code for two different tasks (algorithms):

You will also be asked to test the HMM program on some toy sequences.

The HMM Code Provided

We have provided a significant amount of code in C++ for this assignment. This includes three files hmm.hpp, hmm.cpp, main.cpp and a makefile. The code and the necessary sequence data are available here. The code is almost complete except that several key functions in hmm.cpp are not completely implemented. Your task is to complete them. A detailed description of the HMM algorithms can be found in this note.

The program can be run in three different modes, depending on the option given.

In all cases, the program recognizes the following options:

You need to complete the implementation of the following functions in the file hmm.cpp.

The following is a brief explanation of how an HMM is represented in the provided code.

An HMM is represented by the class Model. We assume a fixed vocabulary that consists of all the 94 ``readable/printable'' ascii characters that are typically on a keyboard. More precisely, the symbol set is fixed as all the 94 ascii characters from [33,!] to [126, ~]. When a symbol is not observed in the data, its count is automatically set to zero, effectively excluded from the model.

The number of state is stored in a variable N. The parameters of an N-state HMM are represented by the following arrays:

More specific instructions on how to implement them can be found in the embedded documentation in the provided code. In all the cases, run `` gmake'' to recompile the code after you change the code.

Tasks

Part I. [45 points] Finding the Most Likely State Path (Viterbi Algorithm)

Part II. [30 points] Supervised Training of HMM

Part III. [25 points] Unsupervised Training of HMM (Baum-Welch Algorithm)

What to turn in

A. Please turn in a hardcopy of your written answers at the class.

B. Please email the following files to our grader, Xuehua Shen (xshen AT cs.uiuc.edu)

  1. Your complete code (i.e.,hmm.cpp). You must make sure it compiles with the provided hmm.hpp and main.cpp; your code may be tested with some new examples.
  2. The following files: sampletag1, sampletag2, samplemod3, samplesupmod1, taggedsampleseq2, sampleunsupmod1.