Goal: to know about this classic paper and appreciate Bush's great vision which has NOT yet completely realized. Minimum reading: Everything starting from section 6. Required level of understanding: deep understanding is NOT required
Goal: to know more about the state of the art of POS tagging and shallow parsing techniques.
Minimum reading: None; read whatever you can understand without worrying too much about the details. Required level of understanding: Deep understanding is NOT required
Goal: to know more about the state of the art of POS tagging and shallow parsing techniques.
Minimum reading: None; read whatever you can understand without worrying too much about the details. Required level of understanding: Deep understanding is NOT required
Goal: to know about some basic concepts in probability, statistics, and information theory.
Minimum reading: Section 3 of the estimation note; All of the information theory note except
for section 1.1.6. Required level of understanding: You should fully understand
the derivation of the maximum likelihood estimate for the binomial distribution, and most of the
contents in the information theory notes. If you can't understand these, you may want to consult
a textbook on probability and
statistics, and a book on information theory. Any book on these topics should be sufficient.
Goal: to know about the overall state of the art of statistical language models.
Minimum reading: Your should try to read the whole paper, but don't worry about some
of the details that you can't understand. Required level of understanding: It's fine to skip some details.
Goal: to know about different smoothing methods.
Minimum reading: None; this is an optional reading, but the paper has a nice
exposition of different smoothing methods. Required level of understanding: Focusing
on the smoothing formulas if interested.
Goal: to know about basic idea behind EM and how it can be used to estimate
a mixture language model.
Minimum reading: Read up to Section 4; skip the HMM part.
Required level of understanding: You should try to fully understand what you read.
Goal: to know about the thorough derivation of EM
Minimum reading: None; this is an optional reading. It's an excellent
rigorous explanation of EM. Read up to Section 3; skip the HMM part. Required level of understanding: Read it only if you are seriously interested in knowing all the details about EM.
Note: This is an excellent tutorial on EM.
Goal: to know about EM from the optimization viewpoint.
Minimum reading: None; this is an optional reading. Required level of understanding: Read it only if you are seriously interested in knowing all the details about EM.
Note: This is an excellent explanation of EM from the viewpoint of opitimization.
Goal: to know about the general history of IR and a summary of IR techniques from empirical perspective. Minimum reading Read the whole paper. Required level of understanding: Focus on the text part; we'll discuss those formulas in class.
Goal: to know how to compute basic retrieval measures.
Minimum reading: Read at least Section 1 and Section 2.
Required level of understanding: You should understand everything in Section 1 and Section 2
Goal: to know some possible alternative TF-IDF weighting formulas
Minimum reading: Read the whole paper
Required level of understanding: You should understand everything in this paper. This paper summarizes the "pre-TREC" effective retrieval formulas. Since TREC started in early 90's, other
improved formulas have been developed -- pivoted normalization and Okapi are the two main formulas.
Recently, language models are getting more popular.
Goal: to know how the pivoted length normalization method is developed.
Minimum reading: Read as much as you can.
Required level of understanding: You should the motivation for the pivoted length normalization
method.
Goal: to know about the foundation of probabilistic retrieval models, including language models.
Minimum reading: Read the whole paper.
Required level of understanding: You should try to understand everything in this paper.
Goal: to know about the Robertson Sparck-Jones model, the BM25 formula and the Okapi system.
Minimum reading: Read section 2 (Foundations) and Section 4 (Data), but this is an optional reading.
Required level of understanding: If you do read it, you should try to understand everything in Section 2.
Goal: to know about traditional probabilistic retrieval models.
Minimum reading: None. This is an optional reading.
Required level of understanding: N/A.
Goal: to know about logistic regression for retrieval.
Minimum reading: None. This is an optional reading.
Required level of understanding: N/A.
Goal: to know about how to build an inverted index.
Minimum reading: Focus on the compression methods for integers and the sorting
based method for building an inverted index.
Required level of understanding: Know how gamma-coding works and know how to
build an inverted index using the sorting-based method.
Goal: to know how query-likelihood retrieval method works and how smoothing is related to TF-IDF weighting.
Minimum reading: Read the whole paper.
Required level of understanding: You should understand everything in the first three sections.
Goal: to know how query-likelihood retrieval method works and how smoothing is related to TF-IDF weighting.
Minimum reading: Focus on Section 3, but this is an optional reading.
Required level of understanding: N/A.
Goal: to know the KL-divergence scoring formula and how a mixture model can be used to do feedback.
Minimum reading: Read the whole paper.
Required level of understanding: Understand how the mixture model works.
Goal: to know about the basic idea of the risk minimization framework.
Minimum reading: Read sections 1, 2, and 3.
Required level of understanding: Try to understand why the risk minimization framework
is more general than a vector space model or a probabilistic retrieval model.
Goal: to understand the basic idea of using score distributions set filtering threshold.
Minimum reading: Read section 1 and section 2.
Required level of understanding: Try to understand how scores of relevant and nonrelevant documents are modeled differently and how to set the filtering threshold based on estimated score distributions.
Goal: to know about some representative algorithms for collaborative filtering. Minimum reading: Read the whole paper. Required level of understanding: Focus on understanding the correlation coefficient method.
Goal: to understand the state of the art of text categorization.
Minimum reading: Read the whole paper
Required level of understanding: Read through the paper without worrying about not following some details. (The Naive Bayes classifier is our focus.)
Goal: to get some sense about which method might work for document clustering.
Minimum reading: Read the whole paper
Required level of understanding: You should be able to understand everything in this paper, but don't worry about any detail that you can't follow.
(The mixture model approach is our focus for clustering methods.)
Goal: to understand the derivation of Baum-Welch algorithm using EM
Minimum reading: None
Required level of understanding: This is a completely optional reading. Read it only if you want to know about EM and Baum-Welch algorithm in depth.
Goal: to understand the basics of HMM
Minimum reading: Read Sections I, II, and III.
Required level of understanding: You should try to understand everything in Sections I and II;
read section III as much as you can.
The following readings [3-7] are ALL OPTIONAL. Read them if you want to know more about the topic.
Goal: to get an overview of web mining
Minimum reading: Read the entire paper.
Required level of understanding: Try to understand the overall picture; ignore the details.
Goal: to get an overview of Web search engines
Minimum reading: Read the entire paper.
Required level of understanding: Try to understand the overall picture; ignore the details.
Goal: to get an sense about what some real Web search engines can do
Minimum reading: Read the entire page at the URL above.
Required level of understanding: Try to know the major features the existing search engines support.