CS397-CXZ Introduction to Text Information Systems (Fall 2003)

| Home | Basic Information | Schedule |
| Readings | Assignments | Project | Resources |




Readings


Because the course covers a wide range of topics, it is very difficult to find an appropriate textbook, and many additional readings have to be used. However, in many cases, you are not required to read or to fully understand the whole content of a paper or book chapter. This page is intended to provide some information about where the key contents are and which part(s) to focus on when reading a paper/book.

Core contents

In general, the lecture slides are the best "definition" of the core contents -- the contents to be tested. That is, you are expected to understand all the major points and algorithms that we have discussed in the class; anything beyond the slides can be regarded as optional. The last slide of each lecture usually summarizes what you should know for that lecture. After each class, you should check the last slide to make sure that you indeed understand all the major points and any necessary technical details. Since some material we cover in the lecture can not be readily found in any of the reading materials, you should also make every effort to come to each class. Come to the office hours if you have any questions about any content.

Textbooks

Other readings

Part I: Understanding text -- Natural Language Processing

  1. V. Bush, As we may think, 1945.

    Goal: to know about this classic paper and appreciate Bush's great vision which has NOT yet completely realized. Minimum reading: Everything starting from section 6. Required level of understanding: deep understanding is NOT required

  2. S. Abney. Part-of-speech tagging and partial parsing. In S. Young and G. Bloothooft, editors, Corpus-Based Methods in Language and Speech Processing, pages 118--136. Kluwer Academic Publishers, Dordrecht, 1997 (citeseer)

    Goal: to know more about the state of the art of POS tagging and shallow parsing techniques. Minimum reading: None; read whatever you can understand without worrying too much about the details. Required level of understanding: Deep understanding is NOT required

  3. E. Charniak. Statistical techniques for natural language parsing. AI magazine, 18(4):33- 43, 1997. (citeseer)

    Goal: to know more about the state of the art of POS tagging and shallow parsing techniques. Minimum reading: None; read whatever you can understand without worrying too much about the details. Required level of understanding: Deep understanding is NOT required

  4. Rosenfeld's notes (estimation information theory)

    Goal: to know about some basic concepts in probability, statistics, and information theory. Minimum reading: Section 3 of the estimation note; All of the information theory note except for section 1.1.6. Required level of understanding: You should fully understand the derivation of the maximum likelihood estimate for the binomial distribution, and most of the contents in the information theory notes. If you can't understand these, you may want to consult a textbook on probability and statistics, and a book on information theory. Any book on these topics should be sufficient.

  5. R. Rosenfeld, "Two decades of statistical language modeling: Where do we go from here?," Proceedings of the IEEE, vol. 88, pp. 1270-- 1278, 2000. (citeseer)

    Goal: to know about the overall state of the art of statistical language models. Minimum reading: Your should try to read the whole paper, but don't worry about some of the details that you can't understand. Required level of understanding: It's fine to skip some details.

  6. S. Chen and J. Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Meeting of the Association for Computational Linguistics. ACL, 1996. (citeseer)

    Goal: to know about different smoothing methods. Minimum reading: None; this is an optional reading, but the paper has a nice exposition of different smoothing methods. Required level of understanding: Focusing on the smoothing formulas if interested.

  7. Rosenfeld's note on EM.

    Goal: to know about basic idea behind EM and how it can be used to estimate a mixture language model. Minimum reading: Read up to Section 4; skip the HMM part. Required level of understanding: You should try to fully understand what you read.

  8. Bilmes, J.: A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report ICSI-TR-97-021, University of Berkeley (1998) ( citeseer)

    Goal: to know about the thorough derivation of EM Minimum reading: None; this is an optional reading. It's an excellent rigorous explanation of EM. Read up to Section 3; skip the HMM part. Required level of understanding: Read it only if you are seriously interested in knowing all the details about EM. Note: This is an excellent tutorial on EM.

  9. Minka's note on EM

    Goal: to know about EM from the optimization viewpoint. Minimum reading: None; this is an optional reading. Required level of understanding: Read it only if you are seriously interested in knowing all the details about EM. Note: This is an excellent explanation of EM from the viewpoint of opitimization.

Part II: Accessing Text -- Text Retrieval & Filtering

  1. A. Singhal, Modern Information Retrieval: A Brief Overview, In IEEE Data Engineering Bulletin 24(4), pages 35-43, 2001. (ps)

    Goal: to know about the general history of IR and a summary of IR techniques from empirical perspective. Minimum reading Read the whole paper. Required level of understanding: Focus on the text part; we'll discuss those formulas in class.

  2. Explanation of TREC measures ( pdf )

    Goal: to know how to compute basic retrieval measures. Minimum reading: Read at least Section 1 and Section 2. Required level of understanding: You should understand everything in Section 1 and Section 2

  3. G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1998.

    Goal: to know some possible alternative TF-IDF weighting formulas Minimum reading: Read the whole paper Required level of understanding: You should understand everything in this paper. This paper summarizes the "pre-TREC" effective retrieval formulas. Since TREC started in early 90's, other improved formulas have been developed -- pivoted normalization and Okapi are the two main formulas. Recently, language models are getting more popular.

  4. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the 19th ACM Conference on Research and Development in Information Retrieval (SIGIR'96), pages 21--29, 1996. (citeseer page)

    Goal: to know how the pivoted length normalization method is developed. Minimum reading: Read as much as you can. Required level of understanding: You should the motivation for the pivoted length normalization method.

  5. John Lafferty and Chengxiang Zhai, Probabilistic relevance models based on document and query generation, In Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval, Vol. 13, 2003. (ps)

    Goal: to know about the foundation of probabilistic retrieval models, including language models. Minimum reading: Read the whole paper. Required level of understanding: You should try to understand everything in this paper.

  6. K. Sparck-Jones, S. Walker and S.E. Robertson, A probabilistic model of information retrieval: development and comparative experiments. Information Processing &Management 36(6), pp. 779-840, 2000.(citeseer page)

    Goal: to know about the Robertson Sparck-Jones model, the BM25 formula and the Okapi system. Minimum reading: Read section 2 (Foundations) and Section 4 (Data), but this is an optional reading. Required level of understanding: If you do read it, you should try to understand everything in Section 2.

  7. N. Fuhr "Probabilistic models of information retrieval," Computer Journal, 35, 3, pp. 244--255, 1992. (citeseer page)

    Goal: to know about traditional probabilistic retrieval models. Minimum reading: None. This is an optional reading. Required level of understanding: N/A.

  8. W. Cooper, F. Gey, and D, Dabney, Probabilistic retrieval based on staged logistic regression, Proceedings of SIGIR 1992. ( ACM DL page)

    Goal: to know about logistic regression for retrieval. Minimum reading: None. This is an optional reading. Required level of understanding: N/A.

  9. Excerpts from Managing Gigabytes (Chapter 3 & Chapter 5)
  10. Goal: to know about how to build an inverted index. Minimum reading: Focus on the compression methods for integers and the sorting based method for building an inverted index. Required level of understanding: Know how gamma-coding works and know how to build an inverted index using the sorting-based method.

  11. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'2001. (citeseer page)

    Goal: to know how query-likelihood retrieval method works and how smoothing is related to TF-IDF weighting. Minimum reading: Read the whole paper. Required level of understanding: You should understand everything in the first three sections.

  12. J. Ponte and W.B. Croft. A Language Modeling Approach to Information Retrieval. In SIGIR 98, pages 275--281, 1998. ( citeseer page )

    Goal: to know how query-likelihood retrieval method works and how smoothing is related to TF-IDF weighting. Minimum reading: Focus on Section 3, but this is an optional reading. Required level of understanding: N/A.

  13. C. Zhai and J. Lafferty, Model-based feedback in the Language Modeling approach to information retrieval. In Proceedings of CIKM 2001. (citeseer page)

    Goal: to know the KL-divergence scoring formula and how a mixture model can be used to do feedback. Minimum reading: Read the whole paper. Required level of understanding: Understand how the mixture model works.

  14. C. Zhai and J. Lafferty, A risk minimization framework for information retrieval, 2003. (ps)

    Goal: to know about the basic idea of the risk minimization framework. Minimum reading: Read sections 1, 2, and 3. Required level of understanding: Try to understand why the risk minimization framework is more general than a vector space model or a probabilistic retrieval model.

  15. Y. Zhang and J. Callan, A generative model for filtering threshold, ARDA Workshop on Language modeling and information retrieval, 2001, Carnegie Mellon University. (pdf)

    Goal: to understand the basic idea of using score distributions set filtering threshold. Minimum reading: Read section 1 and section 2. Required level of understanding: Try to understand how scores of relevant and nonrelevant documents are modeled differently and how to set the filtering threshold based on estimated score distributions.

  16. John S. Breese, David Heckerman, Carl Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering (1998) (citeseer)

    Goal: to know about some representative algorithms for collaborative filtering. Minimum reading: Read the whole paper. Required level of understanding: Focus on understanding the correlation coefficient method.

Part III: Organizing Text -- Text categorization & clustering

  1. Sebastiani, F. Machine Learning in Automated Text Categorisation. Technical Reeport B4-31, Istimto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, 1999. (citeseer)

    Goal: to understand the state of the art of text categorization. Minimum reading: Read the whole paper Required level of understanding: Read through the paper without worrying about not following some details. (The Naive Bayes classifier is our focus.)

  2. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.(citeseer)

    Goal: to get some sense about which method might work for document clustering. Minimum reading: Read the whole paper Required level of understanding: You should be able to understand everything in this paper, but don't worry about any detail that you can't follow. (The mixture model approach is our focus for clustering methods.)

Part IV: Acquiring knowledge from Text -- Information Extraction & Text Mining

  1. J. Bilmes, A gental tutorial of the EM algorithm and its application to parameter estimation for Gaussian Mixture and Hidden Markov Models., UC Berkeley TR-97-021, April, 1998. (pdf)

    Goal: to understand the derivation of Baum-Welch algorithm using EM Minimum reading: None Required level of understanding: This is a completely optional reading. Read it only if you want to know about EM and Baum-Welch algorithm in depth.

  2. L. Rabiner, Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, In Proceedings of IEEE, 77(2). Feb. 1989 (pdf)

    Goal: to understand the basics of HMM Minimum reading: Read Sections I, II, and III. Required level of understanding: You should try to understand everything in Sections I and II; read section III as much as you can.

    The following readings [3-7] are ALL OPTIONAL. Read them if you want to know more about the topic.

  3. Chengziang Zhai. Exploiting context to identify lexical atoms - a statistical view of linguistic context. In proc. of International & Interdisciplinary Conference on Modifying and Using Context (CONTEXT-97), pages 119-- 129, Rio de Janeiro, Brazil, 1997 ( Citeseer URL)
  4. Freitag D., McCallum A.K. (2000). Information extraction with HMM structures Learned by Stochastic Optimization, AAAI-2000, pp. 584-589. (Citeseer URL)
  5. Hai Leong Chieu, Hwee Tou Ng, A Maximum Entropy approach to information extraction from semi-structured and free text, ACL 2002. (pdf )
  6. Dan Roth, Wen-tou Yih, Relational Learning via Propositional Algorithms: An Information Extraction Case Study, IJCAI 2001 (pdf)
  7. M. Hearst, User Interface and information visualization, Chapter 10 in "Modern Information Retrieval". (online version)

Part V: Applications -- Web information management

  1. Raymond Kosala and Hendrik Blockeel, Web mining research: A survey, SIGKDD Explorations, 2(1), pages 1 - 15, 2000. ( Citeseer URL)

    Goal: to get an overview of web mining Minimum reading: Read the entire paper. Required level of understanding: Try to understand the overall picture; ignore the details.

  2. Gisle Hannemyr, SEARCH ENGINE SURVEY An overview of the mapmakers of cyberspace, online document on the Web ( URL)

    Goal: to get an overview of Web search engines Minimum reading: Read the entire paper. Required level of understanding: Try to understand the overall picture; ignore the details.

  3. Berkeley Teaching Lab (URL)

    Goal: to get an sense about what some real Web search engines can do Minimum reading: Read the entire page at the URL above. Required level of understanding: Try to know the major features the existing search engines support.