CS397CXZ Assignment #1: Basic probability/statistis and statistical NLP
(due Sept 19, 2003, 2:00pm)

  1. Probabilistic Reasoning and Bayes Rule [35 points]

    Consider the problem of detecting email messages that may carry a virus. This problem can be modeled probabilistically by treating each email message as representing an observation of values of the following 4 random variables: (1) V: whether the message carries a virus (1 for yes); (2) A: whether the message has an attachment (1 for yes); (3) L: whether the message is longer than 20 words (1 for yes); and (4) K: whether the sender is known to the receiver (1 for yes). Given a message, we can observe the values of A, L, and K, and we want to infer its value of V. In terms of probabilistic reasoning, we are interested in evaluating the conditional probability p(V|A, L, K), and we would say that the message carries a virus if p(V=1|A, L,K)>p(V=0|A,L,K).

    We make a further assumption that p(A,L,K|V)=p(A|V)p(L|V)p(K|V) for V=0 and V=1, i.e., given the status whether a message carries a virus, the values of A, K, and L are independent.

    A. [10 points] Use the Bayes formula and the following known conditional probabilities to compute the probability that a message M with A=1, L=0, and K=0 carries a virus. I.e., compute p(V=1|A=1,L=0,K=0) and p(V=0|A=1,L=0,K=0). Would we conclude that message M carries a virus?

    ----------------------------------------------
    | V  |p(A=1|V)|p(L=1|V) |p(K=1|V) |prior P(V) |
    |----|--------|---------|---------|-----------|
    | 0  |  0.1   |  0.9    |   0.9   |    0.9    |
    |----|--------|---------|---------|-----------|
    | 1  |  0.99  |  0.2    |   0.05  |    0.1    |
    ----------------------------------------------|
    
    B. [5 points] The probabilities in the table above are assigned subjectively, which means that different people may choose to assign different values to these probabilities. Is there any constraint that we must respect when specifying these values? Can we arbitrarily change one of the values without worrying about changing at least another value?

    C. [5 points] Explain how you can change our conclusion regarding whether message M carries a virus by changing the value of only one cell in the table.

    D. [5 points] Note that the conditional independency assumption p(A,L,K|V)=p(A|V)p(L|V)p(K|V) helps simplify the computation of p(A,L,K|V). If we were to specify the values for p(A,L,K|V) directly, how many probability values would we have to specify?

    E. [5 points] Does the assumption p(A,L,K|V)=p(A|V)p(L|V)p(K|V) imply that p(A|L)=p(A)? Why?

    F. [5 points] Explain why the assumption p(A,L,K|V)=p(A|V)p(L|V)p(K|V) does not necessarily hold in reality.

  2. Maximum Likelihood Estimation [15 points]

    A poisson distribution is often used to model the word frequency. Specifically, the number of occurrences of a word in a document with fixed length can be assumed to follow a Poisson distribution given by

    		
    		     u^x * exp(-u)
    p(wordcount=x | u) =  -----------------
    			x!
    
    where "u^x" means u raised to the power of x, and u is the parameter of the Poisson distribution, which happens to be its mean. Now, suppose we observe a sample of counts of a word w, {x_1, ..., x_N}, from N documents with the same length (x_i is the counts of w in one document). We want to estimate the parameter u of the Poisson distribution for word w. One commonly used method is the maximum likelihood method, in which we choose a value for u that maximizes the likelihood of our data {x_1,...,x_N}. Derive a closed form formula for this estimate. (Hint: Write down the likelihood of {x_1,...,x_N}, which would be a function of u. Set the derivative of this function w.r.t. u to zero, and solve the equation for u. )

  3. Getting familiar with text [20 points]

    Here are two current news articles downloaded from the Web: article1 and article2. Download these two articles and save them on your local disk, and perform some simple statistical analysis of the text without reading the contents of the articles.

    Some Unix commands can be very useful for manipulations of text. Here is an example of how you can use commands such as "awk", "uniq", and "sort" to produce word counts. Suppose the text you want to count is named "mysource". You can do the following to generate counts of each word in "mysource":

    1. Do "tr 'A-Z' 'a-z' < mysource > mysource.lowercase". This will convert all capital letters to lower cases.
    2. Do "awk '{for (i=1;i<=NF;i++) print $i;}' mysource.lowercase >mysource.onewordperline". This will split the words on a line so that each line has only one word.
    3. Do " sort mysource.onewordperline | uniq -c > mysource.count". This will first sort all the words and then count the number of occurrences of each word.
    4. Do "sort -rn -k1 mysource.count > mysource.countsorted". This will sort the words in descending order of counts so that you can easily see the high frequency words.
    These steps can also be combined together as one single command:

    "tr 'A-Z' 'a-z' < mysource | awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 > mysource.countsorted".

    A. [5 points] Without reading the articles, do a simple word count for each article with the commands given above. Visually examine the word distribution for each article. Does the word distribution give you some idea about what each news article might be about?

    B. [5 points] Compare the word distributions of these two articles. How are they similar to each other and how are they different?

    C. [5 points] Do part of speech tagging and partial parsing on each article. A POS tagger called "ltchunk" is already installed at /home/class/cs397cxz/ltchunk/bin/ltchunk, so the following command would do the job (assuming your source file is "mysource"):

    "tr 'A-Z' 'a-z' < mysource | /home/class/cs397cxz/ltchunk/bin/ltchunk -show_tags > mysource.tagged".

    Note that you must use the option "-show_tags" to see the POS tags. This will assign POS tags to each word and partially parse the text to identify noun phrases (in brackets) and verbe phrases (in paratheses). For the explanation of Penn Treebank TAGS look at: http://www.ltg.ed.ac.uk/~mikheev/tagger_demo.html#PennTagset. From there you can also find more information about the LTChunk software.

    Examine the results. Did ltchunk make any mistakes in taggin and grouping words? If so, give a few examples.

    D. [5 points] Perl is often handy for processing text. Here is an example of how to use a simple command-line perl script to extract the noun phrases from the output of ltchunk, assuming "mysource.tagged" is the output from ltchunk:

    "perl -ne '{while (/\[\[\s([^\]]+)\s\]/g) {$np=$1; $np=~ s/\_[A-Z]+//g; print "$np\n";}}' < mysource.tagged > mysource.np".

    Use this command to generate a list of noun phrases for each article, and use the "sort" and "uniq" commands to generate counts of noun phrases for each article. Sort the counts so that the high count phrases would be on the top. (Use something like "sort mysource.np | uniq -c |sort -rn -k1 ".) Are these phrase counts more informative than the single word counts in suggesting what each article might be about? Now, read the two articles. Are their contents similar to what you were expecting?

  4. Unigram language model and smoothing [30 points]

    In this part of the assignment, you will use the same two news articles to explore simple unigram language models. Use the "Al Qaida" article as the "test data", and the other as the "training data" for estimating a unigram language model. Use any programming language that you like. You may find it is relatively easy if you use perl since its associative array is very convenient for storing a language model.

    A. [5 points] Compute the empirical word distribution of your test article p(w). This is just the relative frequency of each word, i.e., the counts of w divided by the total number of words in that article. Compute the entropy of p(w), and report the value. (You may choose to compute the entropy directly based on the word counts without explicitly computing p(w).)

    B. [15 points] Assume that the vocabulary size is 10000. Use the "add one" smoothing method to estimate a smoothed unigram language model based on your training article q(w). Compute the following four quantities. Note that q(w) should be regarded as a distribution over all the vocabulary but with zero probability for any word not in the test article.

    1. The cross entropy H(p,q) as defined on the lecture slides.
    2. The log-likelihood of the test article using the trained unigram language model q(w).
    3. The perplexity of q(w) on the test article.
    4. The Kullback-Leibler divergence D(p||q) as defined on the lecture slides.

    C. [10 points] Use "AI Qaida" as a query to search the Web. Cut and paste about 20 lines of text from relevant articles about AI Qaida, and save it as another training article. Use this article to train another unigram language model in exactly the same way as you did with your original training article. Denote this model by q'(w). Compute the same four quantities as you have done for the original training document using q' to replace q (i.e., cross-entropy, log-likelihood, perplexity, and KL-divergence). Compare the four values with what you obtained with the original training document. Which training document gives us a better model for our test article and why?

What to turn in

A. Please turn in a hardcopy of your written answers at the class.

B. Please copy the following files to your handin directory on machines of CSIL, which is /home/class/cs397cxz/handin/assign1/YOUR_ID. (The directory may not be writable until Sept. 15.)

  1. All your code for problem 4. No documentation is necessary; this is just to show me that you did actually compute the values.
  2. One noun phrase count file from problem 3(D). This is to show me that you did actually try those commands to generate the counts.