Consider the problem of detecting email messages that may carry a virus. This problem can be modeled probabilistically by treating each email message as representing an observation of values of the following 4 random variables: (1) V: whether the message carries a virus (1 for yes); (2) A: whether the message has an attachment (1 for yes); (3) L: whether the message is longer than 10 words (1 for yes); and (4) K: whether the sender is known to the receiver (1 for yes). Given a message, we can observe the values of A, L, and K, and we want to infer its value of V. In terms of probabilistic reasoning, we are interested in evaluating the conditional probability p(V|A, L, K), and we would say that the message carries a virus if p(V=1|A, L,K)>p(V=0|A,L,K).
We make a further assumption that p(A,L,K|V)=p(A|V)p(L|V)p(K|V) for V=0 and V=1, i.e., given the status whether a message carries a virus, the values of A, K, and L are independent.
A. [10 points] Use the Bayes formula and the following known conditional probabilities to compute the probability that a message M with A=1, L=0, and K=0 carries a virus. I.e., compute p(V=1|A=1,L=0,K=0) and p(V=0|A=1,L=0,K=0). Would we conclude that message M carries a virus?
---------------------------------------------- | V |p(A=1|V)|p(L=1|V) |p(K=1|V) |prior P(V) | |----|--------|---------|---------|-----------| | 0 | 0.1 | 0.9 | 0.9 | 0.9 | |----|--------|---------|---------|-----------| | 1 | 0.99 | 0.2 | 0.05 | 0.1 | ----------------------------------------------|B. [5 points] The probabilities in the table above are assigned subjectively, which means that different people may choose to assign different values to these probabilities. Are there any constraints on these values that we must respect when assigning these values? In other words, can we fill in the table with 8 arbitrary values between 0 and 1?
C. [5 points] Explain how you can change our conclusion regarding whether message M carries a virus by changing the value of only one cell in the table.
D. [5 points] Note that the conditional independency assumption p(A,L,K|V)=p(A|V)p(L|V)p(K|V) helps simplify the computation of p(A,L,K|V). In particular, with this assumption, we can compute p(A,L,K|V) based on p(A|V), p(L|V), and p(K|V). If we were to specify the values for p(A,L,K|V) directly, what is the minimum number of probability values that we would have to specify in order to fully characterize the conditional probability distribution p(A,L,K|V)? Why? Note that all the probability values must sum to 1.
E. [5 points] Explain why the assumption
p(A,L,K|V)=p(A|V)p(L|V)p(K|V) does not necessarily hold in reality.
A Poisson distribution is often used to model the word frequency. Specifically, the number of occurrences of a word in a document with fixed length can be assumed to follow a Poisson distribution given by
u^x * exp(-u) p(wordcount=x | u) = -------------- x!where "u^x" means u raised to the power of x, and u is the parameter of the Poisson distribution, which happens to be its mean. Now, suppose we observe a sample of counts of a word w, {x_1, ..., x_N}, from N documents with the same length (x_i is the counts of w in one document). We want to estimate the parameter u of the Poisson distribution for word w. One commonly used method is the maximum likelihood method, in which we choose a value for u that maximizes the likelihood of our data {x_1,...,x_N}. Derive a closed form formula for this estimate. (Hint: Write down the likelihood of {x_1,...,x_N}, which would be a function of u. Set the derivative of this function w.r.t. u to zero, and solve the equation for u. )
Here are two current news articles downloaded from the Web: article1 and article2. Download these two articles and save them on your local disk, and perform some simple statistical analysis of the text without reading the contents of the articles.
Some Unix commands can be very useful for manipulations of text. Here is an example of how you can use commands such as "awk", "uniq", and "sort" to produce word counts. Suppose the text you want to count is named "mysource". You can do the following to generate counts of each word in "mysource":
"tr '[A-Z]' '[a-z]' < mysource | awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 > mysource.countsorted".
A. [5 points] Without reading the articles, do a simple word count for each article with the commands given above. Visually examine the word distribution for each article. Does the word distribution give you some idea about what each news article might be about?
B. [5 points] Compare the word distributions of these two articles. How are they similar to each other and how are they different?
C. [5 points] Do part of speech tagging and partial parsing on each article. A POS tagger called "ltchunk" is already installed at /home/class/sp08/cs410/public_html/lt_chunk/bin/ltchunk on CSIL Linux machines, so the following command would do the job (assuming your source file is "mysource"):
"tr '[A-Z]' '[a-z]' < mysource | /home/class/sp08/cs410/public_html/lt_chunk/bin/ltchunk -show_tags > mysource.tagged".
Note that you must use the option "-show_tags" to see the POS tags. This will assign POS tags to each word and partially parse the text to identify noun phrases (in brackets) and verb phrases (in parentheses). For the explanation of Penn Treebank TAGS look at: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
Examine the results. Did ltchunk make any mistakes in tagging and grouping words? If so, give a few examples.
D. [5 points] Perl is often handy for processing text. Here is an example of how to use a simple command-line perl script to extract the noun phrases from the output of ltchunk, assuming "mysource.tagged" is the output from ltchunk:
"perl -ne '{while (/\[\[\s([^\]]+)\s\]/g) {$np=$1; $np=~ s/\_[A-Z]+//g; print "$np\n";}}' < mysource.tagged > mysource.np".
Use this command to generate a list of noun phrases for each article, and
use the "sort" and "uniq" commands to generate counts of noun phrases for each
article. Sort the counts so that the high count phrases would be on the top.
(Use something like "sort mysource.np | uniq -c |sort -rn -k1 ".) Are
these phrase counts more informative than the single word counts in suggesting
what each article might be about? Now, read the two articles. Are their
contents similar to what you were expecting?
In this part of the assignment, you will use the same two news articles to explore simple unigram language models. Use article2 as the "test data", and the other as the "training data" for estimating a unigram language model. Use any programming language that you like. You may find it is relatively easy if you use perl since its associative array is very convenient for storing a language model. In all cases, assume that the size of the underlying vocabulary is 10,000, i.e., there are a total number of 10,000 words that can potentially occur in an article, and the words we see in these two article are only a subset of the whole vocabulary. Thus, any unigram language model should be regarded as a word distribution over all the 10,000 words, though it may sometimes have zero probabilities for many of the 10,000 words. Use the natural logarithm whenever calculation of a logarithm function is involved. (Note that entropy is normally computed with base-2 logarithm, which allows us to interpret an entropy value as the number of bits; here we ask you to use the natural logarithm for convenience since it is directly supported by many programming languages.)
A. [10 points] Compute p(w), the empirical word distribution of your test article (i.e., article2). This is just the relative frequency of each word, i.e., the counts of w divided by the total number of words in that article. Compute the entropy of p(w), and report the value. Note that 0log0=0.
B. [15 points] Use the "add one" smoothing (i.e., Laplace smoothing) method to estimate a smoothed unigram language model, q(w), based on your training article (i.e., article1). Compute the following three quantities.
C. [10 points] Use "flu vaccine" as a query to search the Web. Cut and paste about 50 lines of text from relevant articles about flu vaccine, and save it as another training article. Use this article to train another smoothed unigram language model in exactly the same way as you did with your original training article. Denote this model by q'(w). Compute the same three quantities (i.e., cross-entropy, log-likelihood, and KL-divergence) as you have done for the original training document but now using q' to replace q. Compare the three values with what you obtained with the original training document. Which training document gives us a better model for predicting our test article and why?
A. Please turn in a hardcopy of your written answers in the class.
B. Please pack all the following files into one single zip file or tar file and send the package to our TA, Maryam Karimzadehgan by email (mkarimz2 AT gmail DOT com) with "CS410 - Assign1" as the subject.