REFORM: Robust, EFfective, and Optimal Retrieval Models
Although many different retrieval models have been proposed and studied ever since the beginning of
the field of IR, there has been no single model that has proven to be the best. Theoretically well-motivated models all need heuristic modifications to perform well empirically. It has been a long-standing
scientific challenge to develop principled retrieval models that also perform well empirically.
Existing retrieval models have several fundamental limitations: (1) The performance of a retrieval model is highly sensitive to the document collections and queries in an unpredictable way. (2) A model that performs well on some data set may perform poorly on another data set. (3) Heavy parameter tuning must be done manually to achieve optimal performance.
In this project, we aim to develop novel retrieval models that are robust (w.r.t. the variation of document collections and queries), effective (in terms of retrieval accuracy), and can guarantee optimality to certain extent.
The following are a few specific research directions that we are currently exploring.
Statistical language models
Statistical language models have recently been applied to information retrieval with a lot of success. Due to their solid statistical foundation, they make it possible to automatically tune retrieval parameters through statistical estimation. We are developing new language models that are more robust and effective than existing models.
Axiomatic approaches to information retrieval
Our previous work has shown that intuitive retrieval heuristics can be captured by formally defined
constraints on retrieval functions and through the analysis of these constraints we can predict the empirical behavior of a retrieval method analytically. We are currently extending this previous work to develop a general axiomatic method for studying and developing retrieval models.
Hypertext retrieval model
A major challenge in developing models for hypertext retrieval
is to effectively combine content information with the link
structure available in hypertext collections. Although
several link-based ranking methods have been developed to
improve retrieval results, none of them can fully exploit the
discrimination power of contents as well as fully exploit all
useful link structures. We are currently working on a general
relevance propagation framework for combining content and link
- Hui Fang, ChengXiang Zhai, An Exploration of Axiomatic Approach to Information Retrieval ,
Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR'05 ), 480-487, 2005.
- Hui Fang, Tao Tao, ChengXiang Zhai, A formal study of information retrieval heuristics,
Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR'04), pages 49-56, 2004. Best Paper Award.
- Tao Tao, ChengXiang Zhai, A Two-stage Mixture Model for Pseudo Feedback, Proceeding of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR'04),
pages 486-487,2004. (poster)
- Tao Tao, ChengXiang Zhai, A Mixture Clustering Model for Pseudo Feedback in Information Retrieval ,
Proceedings of the 2004 Meeting of the International Federation of Classification Societies ( IFCS'04). Invited Paper.
- ChengXiang Zhai, Tao Tao, Hui Fang, Zhidi Shang, Improving the Robustness of Language Models - UIUC TREC 2003 Robust and Genomics Experiments ,
Proceedings of 2003 Text REtrieval Conference (TREC2003),2004.
- Azadeh Shakery, ChengXiang Zhai, Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Track Experiments, Proceedings of 2003 Text REtrieval Conference (TREC2003), 2004.