Review List for CS397-CXZ Midterm
Part I: Understanding text -- Natural Language Processing
In this part of the course, you are expected to
- have a general picture of what we can do and what we can't do with today's NLP techniques.
- know what is POS tagging, what is parsing, and what is syntactic/structural ambiguity.
- have a good understanding of some of the very basic concepts in probability, statistics, and information theory.
In particular, you should know the basic rules for conditional probabilities, especially
the Bayes rule.
- know how to do maximum likelihood estimation and how it is different from Bayesian estimation
- know how to compute entropy, cross entropy, mutual information, and KL-divergence, and know their
relations
- know what is a statistical language model, what is a unigram/bigram language model
- know how to estimate a unigram language model using maximum likelihood estimator
- know why smoothing is necessary when estimating a language model and know the formulas for Laplace smoothing,
Dirichlet prior smoothing, and linear interpolation smoothing and their similarities and differences.
- know what is a mixture unigram language model and how to estimate
the mixing coefficient using EM. know what is "leave-one-out" cross validation.
Part II: Accessing Text -- Text Retrieval and Filtering
We have only one lecture to cover filtering, and it will not be examined in the midterm.
In this part of the course, you are expected to
- know how text retrieval is different from database retrieval
- know the distinction of long term vs. short-term information need and how
they can be satisfied in different ways (short-term = ad hoc retrieval; long-term= filtering).
- know why ranking (without an explicit cutoff) is often preferred to selecting
a subset of documents for the user.
- know how to compute the basic retrieval evaluation measures (ie, precision, recall, and
mean average precision)
- know what is stemming, what is a stop word, and the Zipf's law.
- know what is relevance feedback and what is pseudo feedback and how they are different.
- know the basic idea of the vector space model. (What assumptions are we making?)
- know the major term weighting heuristics (i.e., TF, IDF, and document length normalization).
- know the idea and formula for Rocchio feedback.
- know the basic idea of a probabilistic retrieval model. (What assumptions are we making?)
- know how to use logistic regression for retrieval
- know the difference between the classic probabilistic model and the query likelihood language
modeling approach (i.e., document generation vs. query generation). Where are they similar to each other and
where are they different? Know why the document generation models can perform feedback more naturally.
- know the formula for the Robertson-Sparck-Jones model and how to derive it. know why it is hard to
estimate when we don't have feedback information.
- know what is an inverted index and how to build a large inverted index
with only a limited amount of memory. know how to score documents quickly using an inverted index (i.e.,
how to use scoring accumulators for scoring).
- know the basic compression methods for integers (i.e., unary, gamma, delta, and gap).
- know the general retrieval formula of the query-likelihood retrieval method when the document
language model is smoothed with a collection language model, and know why smoothing with a collection language model leads to a retrieval formula that is
similar to a traditional TF-IDF retrieval formula with length normalization.
- know why we need to do two-stage smoothing and the two different roles of smoothing
- know the KL-divergence retrieval formula and why it covers the query likelihood method as a special case.
Know how to use a simple two component mixture model for feedback.
- know the general idea of EM and how to use EM to estimate the component model in a simple mixture model.
- know how some simple retrieval techniques can be adapted to do other things such as text segmentation and summarization
- know that treating retrieval as a decision problem is a more general way of modeling the retrieval problem which allows us to define
retrieval criteria that can capture dependency among documents.