CS410 Text Information Systems (Spring 2008)

Instructor: ChengXiang Zhai

| Home | Basic Information | Schedule |
| Readings | Assignments | Project | Resources |




Readings


Because the course covers a wide range of topics, it is very difficult to find an appropriate textbook, and some additional readings have to be used. However, in many cases, you are not required to read or to fully understand the whole content of a paper or book chapter. This page is intended to provide some information about where the key contents are and which part(s) to focus on when reading a paper/book.

Core contents

In general, the lecture slides are the best "definition" of the core contents -- the contents to be tested. That is, you are expected to understand all the major points and algorithms that we have discussed in the class; anything beyond the slides can be regarded as optional. The last slide of each lecture usually summarizes what you should know for that lecture. You should check the last slide to make sure that you indeed understand all the major points and any necessary technical details. Since some material we cover in the lecture can not be readily found in any of the reading materials, you should make every effort to come to each class. Come to the office hours if you have any questions about any content.
  1. V. Bush, As we may think, 1945 .

    This is truly a classic paper. Read it to appreciate Bush's great vision which has NOT yet completely realized. As a minimum, read everything starting from section 6.

  2. Rosenfeld's notes (estimation and information theory)

    The goal of reading these notes is to know about some basic concepts in probability, statistics, and information theory. You should read at least Section 3 of the estimation note and all of the information theory note except for section 1.1.6. You should fully understand the derivation of the maximum likelihood estimate for the binomial distribution, and most of the contents in the information theory notes. If you can't understand these, you may want to consult a textbook on probability and statistics, and a book on information theory. Any book on these topics should be sufficient.

  3. R. Rosenfeld, "Two decades of statistical language modeling: Where do we go from here?," Proceedings of the IEEE, vol. 88, pp. 1270-- 1278, 2000. ( pdf)

    The goal is to know about the overall state of the art of statistical language models. Your should try to read the whole paper, but don't worry about some of the details that you can't understand. It's fine to skip some details.

  4. A. Singhal, Modern Information Retrieval: A Brief Overview, In IEEE Data Engineering Bulletin 24(4), pages 35-43, 2001. pdf (Error)

    This is a very good overview paper of IR, though it's a bit out of date and slightly biased toward empirically effective techniques. Your goal of reading it is to know about the general history of IR and a summary of IR techniques from empirical perspective. Read the whole paper.

  5. Explanation of TREC measures ( pdf )

    Read at least Section 1 and Section 2 to know how to compute basic retrieval measures.

  6. Book Chapter 8

    Read 8.3 and 8.4. Other sections should also be very interesting to read, though not required.

  7. Review of IR models

    Read this entire review to get a good picture of all the retrieval models

  8. Book Chapter 6 and Book Chapter 7

    Read the entire chapter 6 and Section 7.1.

  9. Book-Ch9

    Read 9.1.1. The rest of the chapter should also be very interesting to read if you want.

  10. Zobel & Moffat 98

    Optional reading. This is a nice evaluation of different weighting methods. Read it if you want to know about many variations of TF-IDF weighting and which variant is relatively more effective.

  11. Book-Ch1-5

    All optional. Read whatever you feel is useful to you.

  12. Book-Ch11-12

    Optional reading. Chapter 11 has a good introduction to "classic probabilistic models", which we didn't cover in detail. Okapi was derived from this family of models with lots of heuristic modifications. Chapter 12 covers the language modeling approach but not in-depth and may be hard to follow.

  13. C. Zhai and J. Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM TOIS, 2004, pdf .

    Read up to section 9.1. That is, skip 9.2 and everything after it. Focus on understanding the basic idea of the query likelihood scoring method, the Dirichlet prior smoothing method, and the two-stage smoothing method.

  14. C. Zhai and J. Lafferty, Model-based feedback in the Language Modeling approach to information retrieval. In Proceedings of CIKM 2001. (pdf)

    The goal is to know the KL-divergence scoring formula and how a mixture model can be used to do feedback. Read the whole paper and try to understand how the mixture model works. Ignore the divergence minimization method.

  15. Note on KL-div Retrieval Model

    Read the entire note.

  16. Note on EM

    Optional. Read it if you really want to understand the EM algorithm rigorously.

  17. L. Page and others, The PageRank Citation Ranking: Bringing Order to the Web (1998) (CiteSeer)

    Read the whole paper. This is a classic paper about Google's PageRank algorithm. Your main goal is to understand the basics of the PageRank algorithm.

  18. Austin's note on computing PageRank

    This article explains clearly how to use the Power Method to compute PageRank.

  19. John S. Breese, David Heckerman, Carl Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering (1998) (url

    Read Section 1, Section 2.1-2.2. The goal is to know how memory-based algorithms work.

  20. C. Zhai and others, Threshold Calibration in CLARIT Adaptive Filtering , Proceedings of TREC 1998.

    The main goal is to understand Section 3. You may want to read some other parts especially Sec 1 and Sec 2 to get some background.

More to be added later.