CS498CXZ Assignment #5: Gene Function Analysis
(due Nov. 10, 2005, Thursday, 12:30pm)

  1. [40 points] K-means Clustering: Implement the Lloyd algorithm for K-means clustering as described on page 347 of the textbook. The input for your program includes: (1) The number of experiment conditions m and the number of genes n; (2) A file containing an n X m data table, i.e., a set of n m-dimensional vectors each corresponding to the expression values of a gene. The data table should be of the following format:
     
    X11 X12 .... X1m
    X21 X22 .... X2m
    ...
    Xn1 Xn2 .... Xnm
    
    where each row represents a vector of expression values for a gene; and (3) The number of clusters k. Note that you may use the data file to implicitly specify m and n if it turns out to be easier. The output of your program should be (1) The squared error distortion value d; (1) The k centroid vectors in the same format as the data file; and (3) The objects in each cluster in the following format:
    cluster1: 1 3 5
    cluster2: 2 6 7 10
    ...
    clusterk: 4 8 12 9
    
    where the numbers are the indices of objects, so for example, the first cluster (i.e., "cluster1") has three objects, corresponding to the 1st, 3rd, and the 5th rows in the data table. In each iteration, you should compute the squared error distortion as described on page 346 of the textbook and test whether the squared error distortion decreases. Use a threshold on the change of the squared error distortion to test whether you are done with the clustering.

  2. [60 points ] Motif Finding: Implement the motif finding algorithm SIMPLE_CONSENSUS. Please click here (MS Word file) to see the detailed description of SIMPLE_CONSENSUS and the required input and output for your program. Print out the output of your program as the written answers. The three data files required for problem C are here: biocoid.fa, hunchback.fa, and kruppel.fa.

    A simple test case is available here.

What to turn in

Please turn in a hardcopy of your written answers at the class. Please email your code and any other related information about your code to the grader Xuehua Shen (xshen AT cs.uiuc.edu).