Personalized Search

                   It can also be viewed from ucair.wordpress.com.  Comments or suggestion? Please send them to

 Cryptography and Privacy preservation in personalization [tag: zero-knowledge proof, privacy]

March 18, 2007

Avi Wigderson gave three lectures at Princeton public lecture series. His three talks are about computation/computability, computational complexity, and cryptography. In the lecture about cryptography, he talked about zero-knowledge proof, private communication, and oblivious communication.

I hope that these techniques can be applied to privacy-preserving personalized search. In the wishful thinking of the privacy-preserving personalized search of my SIGIR Forum paper (Level IV no personal information), the search engine can return relevant results to the user after the user submits a query. At the same time, the search engine does not know what query terms the user submits are.

P.S. The Google changed the privacy policy of search engine logs last week. Google will remove the last 8 bits of 32-bit IP address associated with each query after storing them for 18~24 months.

 How much a Search Engine company can make for each search

March 17, 2007

Recently, Yahoo! began to use their new ad system Panama and hopes to reduce the gap of money-making power between Google and Yahoo. From an article on December 26, 2006 of Business Week, I got to know that Tim Boyd, a financial analyst of Caris & Co. estimated that Google makes 20 cents per search while Yahoo! makes 10 cents per search. During a visit, I told this number to a friend. My friend said he got a different number and sometimes the number from a financial analyst should be double-checked. I agree with the viewpoint of my friend. Moreover, Jon Bentley also suggested that we should use "back-of-the-envelope" calculations, a standard fare in engineering schools.  Here is my "back-of-the-envelope" calculation about the Google's money-making power.

In Q3 2006, the total revenue of Google is $2.690 Billion according to Google income statement. According to Nielson//NetRating data, Google received 2.776 Billion queries (49% US search share) in July 2006, 3.003 Billion queries (50%) in August 2006, and 2.826 Billion queries (50%) in September 2006. Thus in Q3 2006, there are 8.605 Billion queries submitted to Google. If we assume that all revenue of Google comes from Ad (AdWord or AdSense), then on average Google makes $2.690Billion / 8.605Billion query = $0.31/query, i.e., 31 cents per query.

In Q4 2006, the total revenue of Google is 3.205 Billion. According to Nielson//NetRating data, Google received 3.022 Billion queries (50%) in October 2006, 3.098 Billion queries (50%) in November 2006, and 3.036 Billion queries (51%) queries in December 2006. Thus in Q4 2006, there are 9.156 Billion queries submitted to Google. On average Google makes $3.022 Billion / $9.156 Billion query = $0.33/query, i.e., 33 cents per query.

From the simple calculation of Q3 2006 and Q4 2006, we can see Google indeed makes around 30 cents per query on average. Since Yahoo! revenue comes from diverse sources, it is difficult to compute the Yahoo! number according to the number of Nielson/NetRating and Financial report.

New Google Personalized Search

February 22, 2007

Recently, Google pushes personalized search. They now have the personalized homepage, search history and personalized search results. I tried the personalized search and it seems that it is not clear whether they do personalized search or not for a specific query. I think it is one aspect that Google can improve, i.e., get each user informed when personalization happens and which results are personalized results. About this, Marissa Mayer said in an interview

One thing that we've struggled with is if we should actually mark the results are entering the page as a result of personalization but because team is currently and frequently doing experiments, we didn't want to settle on a particular model or marker at this exact moment.

Marissa Mayer, VP of Google, said in the interview

The actual implementation of personalized search is that as many as two pages of content, that are personalized to you, could be lifted onto the first page and I believe they never displace the first result, because that's a level of relevance that we feel comfortable with. So right now, at least eight of the results on your first page will be generic, vanilla Google results for that query and only up to two of them will be results from the personalized algorithm. I think the other thing to remember is, even when personalization happens and lifts those two results onto the page, for most users it happens one out of every five times.

I like the idea of combining personalized search results and generic search results together. In my thesis, I proposed progressive personalization. When the search engine is not confident about the user intention, it can present generic results to the user and at least must not annoy the user by pushing unrelated personalized results; when the search engine are confident about the user intention, it can push personalized results to the user.

In a summary, Google is pushing personalized search in a conservative way.

Google and Kaltix [tag: Kaltix]

February 21, 2007

Besides Outride, Google acquired Kaltix in September 2003.  Here is the press release from Google and an article from CNET about Kaltix in August 2003. There are three founders in Kaltix and they may be Taher, Haveliwala, Sepandar Kamvar, and  Glen Jeh. They co-authored a paper to do analytic comparison of personalized PageRank.

Initially, each guy has a first-author publication related with personalized PageRank. 

Haveliwala: Topic-Sensitive PageRank, WWW02;
Jeh: Scaling Personalized Web Search, WWW03;
Kamvar: Extrapolation Method for Accelerrating PageRank computation, WWW03. 

Recently, Professor Junghoo Cho from UCLA has a related publication: Automatic Identification of User Interest for Personalized Search, WWW06. His work is to incorporate implicit feedback into the PageRank.

Google and OutRide [tag: Outride]

February 20, 2007

Recently, Google introduced more personalization technology at their website, which I will review later. But back to September 2001, Google had already acquired the outride, a startup of doing personalized search. Outride is a spinoff of Xerox PARC (Just recently, Xerox PARC has a deal with the search engine startup Powerset to do natural language search).

Outride was founded by Jim Pitkow,  Hinrich Schutze, and Todd Cass. It is one of earliest systems doing personalized search. The most relevant publication about Outride is an article of Communication of ACM. From the article, I can see that outride is also doing personalization at the client side and uses query augmentation and result reranking techniques. It looks that they implemented a plug-in of web browser (sidebar), like toolbar. From the paper, there are not many technical details revealed.

UCAIR emphasizes the eager feedback, i.e., when the user has the interaction with the retrieval system such as selecting a web page, the system can make some responses, e.g. updating the user model. UCAIR is based on the decision-theoretic framework and context-sensitive statistical language model.

Haveliwala's Topic-Sensitive PageRank [tag: PageRank, Topic-Sensitive Retrieval]

February 18, 2007

I reviewed Haveliwala's Topic-Sensitive PageRank paper, which is the best student paper in WWW 2002. This work is one of early research efforts in the personalized search based on PageRank algorithm. I think it is a really solid work. The author used Stanford WebBase crawler to crawl a part of the Web and  ODP to build a personalization vector and a probability distribution of query words given each topic.  The author used overlapping rate and a variant of Kendall distance as the evaluation metrics. Besides that, author also conducted a user study to evaluate the performance of topic-sensitive PageRank. In the end, the author also mentioned some potential interesting problems and directions about personalized search such as privacy and the discovery of query context.

The idea is to compute a list of PageRanks (instead of a single PageRank) for each web page, i.e., for each topic, there is a PageRank score for each web page. This topic-sensitive PageRank score can be computed according to the web graph and the topic classification of each web page using ODP data. Then for each user query, search engine computes the probability distribution of topics for this query and compute a weighted average (weight is the PageRank score of the topic)  as the final rank score. For the probability distribution of topics for each query, search engine can check the query words and get the distribution directly. Search engines can also compute the probability distribution according to the query and its context. Authors conducted a user study (5 users and each user did 10 queries).

This work is done at the server side and can directly be applied to the search engine. But it can not be directly applied at the client side since client side search agent does not have the web graph. The topic selection is at the coarse granularity since it just uses the top-level ODP topic categories. For each individual person, we can also have a topic category. 

Two talks about Search Security [tag: Privacy]

January 28, 2007

There are two talks related with search or search personalization.

One is a talk about search privacy, by Dr. Lorrie Faith Cranor, a professor at CMU.

The other is a talk about Secure Personalization: Towards Trustworthy Recommender Systems, by  Dr. Bamshad Mobasher, a professor at Depaul.

Sunset of Findory [tag: Findory, Personalized News]

January 27, 2007

Today, I got to know that Findory, a personalized news website,  “rides into the sunset“.  It is a sad news. But I believe that personalization technology will succeed somewhere in the real-world applications.

A Talk about Privacy-Enhanced Personalization [tag: Privacy-Enhancing]

January 26, 2007

I found that there is a talk by Dr. Alfred Kobsa, a professor at UCI. The title of the talk is Privacy-Enhanced Personalization. It should be very relevant to my thesis research on privacy-preserving personalized search.

Susan Dumais' Personalized Search Talk at Yahoo! Research [tag: personalized search, Stuff I've Seen, implicit query, Phlat]

January 07, 2007

From Greg Linden's blog, I got to know Susan gave a personalized search talk at Yahoo! Research. Video of the talk is available at Yahoo! Video. Susan will also come to the town on Mar 26, 2007 and give a talk on Information Retrieval in Context. 

Update on January 8, 2007: Susan Dumais was named as the ACM fellow for her work of information retrieval and human-computer interaction. In recent years, Susan did a lot of research on personalized search and had several influential projects such as Stuff I've Seen, Implicit Query, and Phlat. In Class of 2006, there are three fellows doing information retrieval research. Besides Susan, there are Giles, C Lee (CiteSeer) and Peter Norvig (Google). There are also quite a few fellows doing database and data mining research.

ACM Recommendation Policy on Privacy [tag: privacy, search engine log]

January 05, 2007

In June 2006, US ACM published a recommendation policy on privacy on ACM website (Visit http://www.acm.org/usacm/Issues/Privacy.htm for the content). To strike a balance between individual privacy protection and valid governmental and commercial usage, ACM recommends minimization, consent, openness, access, accuracy, security, and accountability.

In August 2006, there was AOL search log incident. Now, the search engine has become an indispensable tool for people in daily life. However, many people may not be aware that search engines actually store a lot of personal information and can potentially reveal a gamut of individuals' private lives such as medical history and hobbies. I think compared with recommendations of ACM, search engine companies have a long way to go. For example, people currently virtually have no access to search engine logs, although no personal identity is stored at the search engine side. Moreover, the search engine logs probably are stored at search engine data servers indefinitely.

Some search engines such as Google Personalized have implemented personalized search functionality, some interfaces are provided for the user to modify these data. For example, Google let users delete search history entries one by one. But it is still not convenient for users. For example, users can not remove several entries in a batch mode. 

Some Statistics Related with Web Search [tag: statistics, search engine, query, monetization]

December 26, 2006

Number of Indexed Web Pages

A couple of years ago, search engine competed on how many web page they indexed. They continuously put larger and larger number on their home page and sometimes one party which wrote a smaller number argued that other parties overestimated the number or had different methods of calculation. Recently, Google removed this number from the home page. Later, Yahoo! and MSN followed. It seems that the number of indexed web pages is not so interesting any more. Many people estimate that there are tens of billions of web pages on the "surface web" and far more hidden web pages from searchable databases in the "deep web".

Number of Queries

Instead of having a war on the number of indexed web pages, currently search engines compete on how many queries users submit to their search engines, which is directly related with the revenue of a company. There are some Internet media research companies are reporting these numbers. The most frequently quoted numbers are from Nielson//NetRatings and comScore. Nielson//Rating has a monthly report about query shares of search engines. In November 2006, an estimated 6.2 billion queries were conducted at U. S. search engines. Google is on the top and has 3.1 billions queries (49.5% share). The following search engines are Yahoo! (24.3%), MSN (8.2%), AOL (6.2%), and Ask (2.6%). Monthly estimates of  U.S. search engine queries in the second half of 2006 by Nielson//Ratings are as follows.

October: 6.0 billion; September: 5.6 billion; August: 6.0 billion; July: 5.6 billion; June: 5.4 billion; June: 5.7 billion.

Google consistently takes 50% share of U.S. search queries. Yahoo! is around 25%~30% and MSN is around 8%~10%.  

Monetization of Queries

Although the number of conducted search queries are directly related with the money that search engine companies can make, but it is not proportional to ad revenue. Ad revenue also depends on the advertisement auction and placement system.  For example, although Yahoo! search share is about 1/2 of Google share, Yahoo! ad revenue is only about 1/4 of Google ad revenue. According to the estimate of Caris & Co. analyst Tim Boyd,

"Yahoo made on average between 10¢ and 11¢ per search in 2006, bringing in a total of $1.61 billion for the first nine months of the year. Google, meanwhile, makes between 19¢ and 21¢ per search. As a result, it made an estimated $4.99 billion during the same period."    (Quoted from an article of BusinessWeek

We can see that on the average, each submitted query can make 20 cents for Google and only 10 cents for Yahoo!  

Collarity, a Startup of Personalized Search [tag: startup, personalization, industry]

November 26, 2006

From the blog of Venture Beat, I got to know there is another startup doing personalized search Collarity.  There are already quite a few startups doing personalized search such as Surf Canyon. 

I tried this version after registering an account. There is a slider called relevance compass, which can let individual users continuously tune the search results from the extremely personalized level through community level to totally population level. This implementation is same as what Microsoft Researcher Susan Dumais did for personalized search with the former intern Jaime Teevan.  After trying some queries and clicking some results, I could how my search results got personalized. Maybe it is still in the early stage of the company. There are some different suggested terms appearing at the bottom of the compass when users move the slider. But the speed is slow and the suggested term is needed to be selected by user for the addition into the query. Here is a paragraph from Venture Beat about the Collarity.

"Levy Cohen, chief executive of Palo Alto-based Collarity, said he got his idea to launch Collarity because it bothered him that Google returns the exact same results to people even if they have different interests. If you’ve searched for information on Linux before, then the search engine should return results relevant to open source, he said. Moreover, if you search for “Java,” the search engine should know whether you’re more likely interested in the computer language, or coffee."

-Venture Beat

Collarity claims to use the search result of people "like you" to personalize the search results. However, I can only imagine that using other similar users' interest, we can at most get the community/group level personalization. If we really want the personal level personalization, we should use the user's own user profile. The idea of Collarity is the collaborative filtering idea, which is extensively used in recommendation systems such as those at Amazon.com and Netflix. But most personalization research in academia is focused on exploiting the user's own profile. On the other hand, we may combine these two ideas (i.e., item-based and user-based).

One comment mentions that the Collarity is similar with a demo of Yahoo! Research, i.e. Mindset. I find that the interface of Mindset also uses the slider to vary the results from shopping to research.

Andrei Broder's Information Supply Talk [tag: personalizationprivacy, implicit query]

November 25, 2006

From the blog of Geek with Greg, I got to know the talk by Andrei Broder on Information Supply. The slides are available here

In the slides, Andrei Broder wants to express his opinion about the next generation web search. In his mind, Information Supply should be the next step of Information Retrieval. He mentions that search engine can infer the user information need and provide relevant information to the user even without the user explicit query. Actually, some research works done by Susan Dumais and Mary Czerwinski on Implicit Query is in this direction.

I think Andrei Broder's information supply vision matches contextual search/personalized search vision. We need to infer the user information need to understand the user real intention so that we can get better search results.  Currently, the user can easily find satisfactory results from the Web such as finding a homepage of a person or a company. However, the searchers can not find a satisfactory answer for many search tasks too. We need to do research on improving the user search experience or information seeking/acquisition experience.

Andrei Broder gives some general ideas about how the information supply should work. However, he did not give some concrete problems we need to attack. I think here are some problems we will face.

1) What kinds of information seeking activities can personalized search help? I do not think personalized search can help every search. For some search tasks, personalization can even deteriorate the search experience because of imprecise user modeling. Maybe personalization should target at the difficult information seeking activities.

2) How should privacy issue be dealt with? Privacy is a big concern of personalized search because a lot of personal information will be disclosed and can be potentially abused. We need to study how different levels of privacy can fit different individual user's acceptable privacy levels, how the personalized software architecture should be chosen and how we can implement the personalization systems to guarantee the appropriate privacy protection levels.

3) How should personalized search interact with the user? The user may not be willing to actively participate in the personalization search process. In such cases, we need to consider how to do personalized search in an implicit way. If the user is willing to contribute to personalized search, we need to think a way to get the user involved. Moreover, how should we design the user interface to make the user understand how the personalized search work instead of assuming the user simply accept the black box magic of personalized search. How should we design the personalized search interface to facilitate the personalization process?

Some other questions have been proposed in previous blog entries. 

 

 Vertical Personalized Search [tag: vertical search, healthcare, law, Healia]

November 21, 2006

I talked with a researcher about the personalized information management in the healthcare domain. Contextual search is considered as a promising way to improve the information seeking of practitioners in a specific domain. It is interesting to see that vertical personalized search or personalized search in a specific domain has been given a lot of attention.  For example, Healia is a startup to provide personalized health information retrieval service in the health domain.

So far, I have known two domains which are interested in the personalized search, law and healthcare. For both of  these domains,  people have to look for the needle in the haystack and people really care to find relevant information even by interacting with the retrieval system for many iterations for a single information need, which provide the opportunities for the personalized search algorithm to get enough information about the user intention.

However, I also show the concerns about the feasibility of the application of personalized search in these domains. For example, I met a researcher in the law information system company, who complained that lawyers did not want to try the personalized search prototypes because of the privacy concern. Thus I also wonder what the opinion of the doctors about the personalized search is. Thus in order to apply personalized search in a specific domain, we may need to do some survey to investigate whether the people in this domain really like and accept the idea or not.

But I will think vertical personalized search will become more popular in the future, not restricted to healthcare or law domain.

local.live.com expires in year 4001[tag: privacy, cookie]

September 05, 2006

I checked the cookies on my web browser Firefox and found that one cookie of local.live.com has the following attributes.

Name: SerializationVersion
Content: 2
Host: local.live.com
Path: /
Send for: Any type of connection
Expires: Thursday, February 15, 4001 11:59:00 PM

Can we imagine what the world will be in year 4001?

I checked cookies of many websites and found it is common that the expiration date of cookies are set far beyond the death of my laptop, year 2011 of mail.google.com, year 2016 of microsoft.com, year 2036 of amazon.com, year 2037 of yahoo.com....


Here is some information about the Internet cookie http://webmaster.info.aol.com/aboutcookies.html.

We need to seriously think about the privacy and security of Internet browsing behavior now. Same for Internet search activities. We need to care the user privacy for personalized search too.

 

How Was the AOL Searcher No. 4417749 Identified? [tag: privacy, search engine log]

August 24, 2006

There is a NY Times report on August 9, 2006 titled as A Face Is Exposed for AOL Searcher No. 4417749. A lady in Georgia was identified and a photo of her was put on the NY Times website too. Here is how her identity was discovered. The searcher No. 4417749 searches "landscapers in Lilburn, Ga", “homes sold in shadow lake subdivision gwinnett county georgia”, "retirement communities for single women", multiple times "eugene oregon jaylene arnold" or "jarrett t. arnold". An investigator, maybe a reporter, came to the town of Liburn, GA and checked several people with the name Arnolds. Thelma Arnold, a 62-year-old widow who lives in Lilburn,  then said "“Those are my searches,” after the reporter read part of the list to her.

Privacy is a serious issue of personalized search research. Put the personalized search on the client side can alleviate the privacy concern.

Update: At Eric Selberg's blog (08/09/2006 entry), there is a link of DexOnline (online phone book), which lists 25 Arnolds in Lilburn, GA. He suspected Ms. Arnold was tracked down using high-tech means such as calling all the Arnolds in Lilburn, GA.

Partnership of Yahoo and EBay [tag: Web2.0, industry]

May 25, 2006

In industry, news of partnership of Yahoo and eBay boost the share prices of both companies. Many people think it is a win-win situation. I think so too. Both eBay and Yahoo really need some good news to boost the confidence of investors. Google is eating away the search share of Yahoo.  GBuy and Google Base are threats to eBay and Paypal.

From the technology perspective, Yahoo now forwards in the social media direction. For Yahoo, one potentially advantage of partnering with eBay is the huge user base of eBay. Moreover, many eBay users are very serious and loyal. Like MySpace and Facebook, Yahoo! can build a big social network based on shoppers and businessmen of eBay.  Yahoo can provide the personalized search and recommendation system service to the eBay users.

Personalization and Web 2.0 [tag: Web2.0]

May 22, 2006

Web 2.0 is hot. O'Reilly believes that one important feature of Web 2.0 is collective intelligence.  I consider the collective intelligence as the same thing as manpower or mass collaboration.

Does Personalization belong to Web 2.0? In my opinion, it does not in the narrow sense since the personalization technology does not necessarily utilize collective intelligence. However, personalization is strongly related with recommendation systems, collaborative filtering and social network, which belong to Web 2.0. Thus it belongs to Web 2.0 in the broad sense.

PIM Workshop of SIGIR 2006 [tag: personal information management, UCAIR, SIGIR]

May 21, 2006

At SIGIR 2006, there is a two-day workshop about Personal Information Management (PIM). We submit a paper about capturing and exploiting personal search history to improve retrieval accuracy. Here is the abstract of the submission.

Personal search history is an important type of personal information that is critical for learning a user's interests and information needs and can be exploited to improve the search service for a user. In this paper, we describe our recent work on User-Centered Adaptive Information Retrieval (UCAIR), which aims at capturing personal search history with a client-side search agent and exploiting the history information to help a user optimize search results.

We propose a decision theoretic framework and develop techniques for implicit user modeling based on a user's personal search
history.  We propose several context-sensitive retrieval algorithms based on statistical language models to combine the personal search history with the current query for better ranking of documents. Using these techniques, we have developed an intelligent
client-side web search agent, i.e., the UCAIR search agent, which can automatically capture a user's personal search history, store it in XML format on the local disk, and exploit it to provide personalized search.

 

Watson Commercialized [tag: Watson, industry]

December 12, 2005

Today, I read an article of Chicago Tribune  (free registration) about  the software Watson. Watson is commercialized after a quiet period. There are two academic papers about Watson project, one is the IUI 2001 paper and the other is the JASIS 1999 paper, both of which are coauthored by Jay  Budzik and Kristian Hammond.  It is interesting to see that this academic project got commercialized.

I do not try to install Watson, although it is free. It looks pretty like Google desktop search and I have installed Google desktop search on my laptop. From the research point of view, I did not find any new feature provided by Watson from the demo at the website so far.

Back Button of Web Browser in Personalization [tag: UCAIR, web browser]

December 11, 2005

UCAIR toolbar changes the semantics of Back button of the web browser. Using Internet Explorer with UCAIR toolbar, when the user clicks one result of search result page and then clicks the Back button, the user will see different contents of search result page. This is because the UCAIR personalized search agent updates the user model immediately after the user makes an action (click a result link) and rerank the search results according to the updated user model. So the user will see reranked search result page, which probably is different from the page previously seen by the user. Thus the semantics of back button has changed after the installation of UCAIR toolbar.

During several demos of UCAIR toolbar, many people are interested in the semantics change of the back button. A lady said she would like to see the same stuff as before after clicking the back button. Some people are interested in how to minimize the confusion brought to the user with the semantics change such as where pushed up results should be places if UCAIR toolbar has to change the semantics of the Back button.

I found the breaking of Back button was considered to be one of top web design mistakes by Jakob Nielsen in 1999.  The semantics of Back button is a question for the web design now, especially with many dynamic web design techniques such as Ajax.   What does the user expect when he clicks the Back button? Probably the answer will not be consistent. There is some research works on the Back button of web browser such as Getting Back to Back by Saul Greenberg and Andy Cockburn. 

 

Personalization and Privacy [tag: privacy]

December 10, 2005

There is a book Make It Personal, which is about personalization, privacy and profit. Here is this book's Amazon link. The author is Bruce Kasanoff.  This book talks about how to do one-to-one marketing without invading privacy. There are some good reviews about this book at Amazon, especially the review of Peter Leerskov. This book looks a good e-commerce book. Personalization in e-commerce is still a buzzword. We can easily see there are so many websites which claim to be personalized websites.

For the personalized search, recently it is also a very active research area in ACM SIGIR community and search engine industry. Privacy is a companion word of personalization, although industry looks to be much more serious about this problem than academia (I know ACM SIGMOD community is doing a lot of research on privacy of database.).

There are some bills about privacy. EPIC (Electronic Privacy Information Center) is a good resource of online privacy including the bill-track, where you can find bills related with privacy passed by 105th-109th Congress.  

 

A Discussion about Personalization [tag: Vivisimo, industry]

December 6, 2005

Long time ago, I mentioned the Vivisimo CEO's comments about personalization. I just found that on Greg Linden's blog, Greg has a post and there are some interesting follow-up comments.

Again, generally I disagree with the "dead end" viewpoints. But we need to do solid work to demonstrate the advantage of personalization technology.

Implicit Feedback, Pseudo Feedback, Relevance Feedback and Active Feedback - UCAIR (14) [tag: implicit feedback, pseudo feedback and relevance feedback, active feedback]

October 21, 2005

Implicit feedback is a popular way to do personalized search. But general audience may confuse it with pseudo feedback and relevance feedback. So it is worth making a clarification here.

Relevance feedback in information retrieval research was proposed in the 1970's by Gerald Salton and his co-workers as a way to improve retrieval accuracy. Relevance feedback works in the following way. After the user submits a query, the retrieval system will do the first run to rank documents and then present a few top ranked documents for the user to explicitly judge the relevance. After getting the user relevance judgment of these documents, the retrieval system will combine these judged documents with the original query through query expansion to do the second run and present newly ranked documents to the user. A lot of empirical evaluations show that relevance feedback is an effective way to improve the retrieval accuracy. Rocchio feedback formula is the most popular formula to do relevance feedback using vector space model. Model-based feedback proposed by ChengXiang Zhai in his CIKM 2001 paper is a popular way to do relevance feedback using statistical language model.

However, in many retrieval tasks such as web search, the user is not willing to provide the relevance feedback to the retrieval system. So pseudo feedback was later proposed. Pseudo feedback works in the following way. After the user submit a query, the retrieval system will do the first run to rank document and pick a few top ranked document. These top ranked documents are assumed to be relevant by the retrieval system and are combined with the original query through query expansion to do the second run. The retrieval system presents newly ranked documents to the user. Here we can clearly see that relevance feedback needs user involvement in the relevance judgment process while pseudo feedback does not. A lot of empirical evaluations show that pseudo feedback generally, but not always, can outperform the baseline retrieval. However, pseudo feedback is not as effective as relevance feedback.

Relevance feedback is not applicable in many search activities, while implicit feedback totally excludes the user in the feedback process. So either relevance feedback or implicit feedback has limitations. In interactive information retrieval such as web search, the user generally has many interactions with the retrieval system. During these interactions, the user gives a lot of hints to the retrieval system, which can help the retrieval system infer the user's information need better. Thus implicit feedback was proposed. Implicit feedback works in the following way. The retrieval system will store user interaction data such as query and clickthrough history,  infer the user's information need better through these interaction data, compose the new query to rank documents and present ranked documents to the user. We can see that implicit feedback neither asks for the user's explicit relevance judgment nor categorically assumes that top ranked documents of baseline retrieval are relevant. Instead, implicit feedback intelligently infer the user's information need through those hints implicitly provided by the user.  However, there is a caveat for implicit feedback. We need carefully analyze those hints and do not incorporate noise into the new query, which may even hurt the retrieval performance. Read the paper Context-Sensitive Information Retrieval Using Implicit Feedback for more discussion and references.

To summarize the difference of these three feedback techniques, relevance feedback asks the user explicit relevance judgment; pseudo feedback assumes top ranked document of baseline retrieval are relevant; implicit feedback tries to better infer the user's information need through the data implicitly provided by the user.

Active feedback was proposed in the paper Active Feedback in Ad-hoc Information Retrieval. Active feedback can be considered as a kind of relevance feedback. But traditional relevance feedback focuses on how to incorporate judged document into the new query (e.g., query term addition and query term reweighting), while active feedback studies which documents should be presented to the user for relevance judgment in order to maximize the learning benefits of the retrieval system from the user judgment. A general framework was proposed in the paper and several specific algorithms were deduced from the framework.

 

Motivation for Personalized Search - UCAIR (13) [tag: difficult query, UCAIR]

October 20, 2005

In research papers or presentations, people often use ambiguous queries for the motivation of contextual or personalized search.  Often used ambiguous query examples are "bass" (fish or instrument), "java" (programming language, island or coffee), "jaguar" (animal, car and Apple software) and "IR application" (Infrared application or Information Retrieval application).

These ambiguous queries are really one motivation for contextual search. However, the motivation of contextual search is not limited to the query disambiguation. In my SIGIR 2005 paper, I showed that for 30 hard topics selected from TREC (Text REtrieval Conference) topics 1-150, the search needs to be put in context.  These topics are called hard topics because previous experiments show that they have very poor retrieval performance using traditional retrieval algorithms. When I look through these hard topics, I can see most of topics are hard not because they are ambiguous. Instead, these topics are inherently hard because1) it is very hard for the user to specify the information needs clearly since the description of these topics is very complex; 2) it is very hard for the retrieval system to find relevant documents since there are very few relevant documents among the huge document collection. We demonstrate that using context information (query history and clickthrough data), we can improve retrieval performance.  Here is an example of those hard topics. Each TREC topic is composed of topic number (unique ID), title, description, and narrative.

<topic>
<number> 2
<title> Acquisitions
<desc> Document discusses a currently proposed acquisition involving a U.S.
company and a foreign company.
<narr> To be relevant, a document must discuss a currently proposed acquisition (which may or may not be identified by type, e.g., merger, buyout, leveraged buyout, hostile takeover, friendly acquisition). The suitor and target must be identified by name; the nationality of one of the companies must be identified as U.S. and the nationality of the other company must be identified as NOT U.S.
</topic>

 

For this topic, the description of information need is very complex and there are a lot of constraints. Moreover, there are only 283 relevant documents in the whole document collection (this TREC collection has 242918 documents.).  Here is a real query sequence (4 queries in a sequence) submitted by a single user and the corresponding poor retrieval performance. MAP means Mean Average Precision, which is a good (but not intuitive) measure for the overall retrieval performance and Pr@20docs means how much percentage of top 20 documents are relevant, which is a good measure for the web search performance since many users only care about the relevance of top ranked results.

First query: acquisition u.s. foreign company
MAP: 0.004; Pr@20docs: 0.000

Second query: acquisition merge takeover u.s. foreign company
MAP: 0.026; Pr@20docs: 0.100

Third query: acquire merge foreign abroad international
MAP: 0.004; Pr@20docs: 0.050

Fourth query: acquire merge takeover foreign european japan
MAP: 0.027; Pr@20docs: 0.200

To summarize, query disambiguation is one motivation of contextual or personalized search. However, it is not the only motivation. For information seeking activities for hard topics, we also need to put the search in context.
 

Two Patents about Search Engine Personalization - Industry Series (10) [tag: patent, industry]

October 2, 2005

There are two patent applications related with the search engine personalization.

One is from Google, Variable personalization of search results in a search engine,which was demonstrated somewhere on the Google website before, although it had disappeared. The basic idea is to have a slider button for the user to tune the degree of personalization. Here is the abstract of the patent application.

This invention would enable a searcher to fill out a profile, perform a normal search, and then use a slider button to indicate how much his or her personal information from the profile should be used to modify (rerank) that search based upon the personalization information that they have entered into the profile, by sliding the button partially, or all the way to a full influence on the results.

The other is from Yahoo! Color Graphing and Personalization. Here is the abstract of the patent application.

In a search processing system, identifying input authority weights for a plurality of pages, wherein an input authority weight represents a user's weight of a page in terms of interest; distributing a page's input authority weight over one or more pages that are linked in a graph to the page; and using a resulting authority weight for a page in effecting a search result list. The search result list might comprise one or more of reordering search hits and highlighting search hits.

Some Discussion about Thorsten's ACM SIGIR 2005 Paper - SIGIR Series (4) [tag: click bias, relative relevance]

October 1, 2005

Jakob Nielsen has an article about Thorsten's ACM SIGIR 2005 paper (Visit September 3 more information about this paper), which spurs some discussion at  the Cre8site Forum. It is interesting to read the discussion about how to do user search behavior research in an unbiased way and some research findings of this paper.

Vivisimo teams with MSN for FirstGov.gov - Industry Series (8) [tag: Vivisimo, industry]

September 30, 2005

Vivisimo teams with MSN to provide the search technology for U. S. government FirstGov.gov portal, which is reported in the 09/26/2005 article of Search Engine Watch. Compared with well-exposed Google activities, which always attract media attention, even when it is about the new business of ex-chef of Google (see Google to Noodles: A Chef Strikes Out on His Own from New York Times) and hiring activity of some new chefs (see Wanted at Google: A few good chefs from News.com), the report about this event is relatively minimum.

Vivisimo has interesting technologies to do search engine result clustering. Raul Valdes-Perez, CEO of Vivisimo thinks that the personalization is a dead end and had written an article about it, which I do not agree in general. The problems he mentioned in the article had been addressed or are being addressed in the personalization research.

A New Version of UCAIR Toolbar - UCAIR Series (11) [tag: UCAIR]

September 22, 2005

There is a new version of UCAIR toolbar, which can be downloaded from the UCAIR project website. This version is rewritten by Bin nearly from scratch. We redesigned the software architecture of UCAIR toolbar, which aims to be extensible and robust.

A Seminar Course about Search Engines in SIMS, Berkeley - Academia Series (1) [tag: seminar, Marti Hearst]

September 21, 2005

There is a seminar course (Search Engines: Technology, Society, and Business) offered in SIMS, Berkeley in fall 2005. From the course website, it is said "A set of top-notch experts have agreed to give lectures for fall 2005." Among them, Dr. Susan Dumais from Microsoft Research and Dr. Sepandar Kamvar (co-founder of Kaltrix) from Google will give lectures. Both of them are doing personalized search. Thus the topics of them probably are related with the personalized search. The slides and videos for some talks are available at the website.

Personalized Search Papers at ACM CIKM 2005 - CIKM 2005 Series (1) [tag: CIKM, UCAIR, Y!Q]

September 20, 2005

CIKM 2005, one of top information retrieval research conferences, will be held in Bremen, Germany from October 31st to November 5th. The last session of this conference is about context and personalization. There will be three paper presentations in this session:

              Context Modeling and Discovery Using Vector Space Bases by  Massimo Melucci (University of Padua)
            

              Y!Q: Contextual Search at the Point of Inspiration by Reiner Kraft, Farzin Maghoul, Chi Chao Chang (Yahoo! Inc.)
            

             Implicit User Modeling for Personalized Search by Xuehua Shen, Bin Tan, Chengxiang Zhai (CS, UIUC)


For the Y!Q paper,  the blog of the first author Reiner Kraft explains the new feature of Y!Q. When you read a web page and are interested in some phrases or a sentence, you can mark them and trigger the search. Actually this functionality appeared in the defunct IntelliZap system (See WWW 2001 paper).

Search Engine Web APIs - Industry Series (7) [tag: Web API, industry]

September 19, 2005

Google Web API provides a way for programmers to develop interesting search related applications utilizing the power of Google search engine. But currently there are some limitations for programmer to develop a large-scale application. I notice that there are at least two limitations. One is that one account can at most submit 1000 requests one day and the other is that for each query the user can only get at most 10 search results. With these two limitations, the client-side programs can not get many results frequently from Google through Google Web API and thus can not do many interesting processing such as result reranking at a large scale.

Yahoo Web API  permits 5000 queries per IP per day and 50 search results per query. So Yahoo Web API is friendlier to developers. Meanwhile, MSN is also preparing to release their Web APIs (see news from News.com). Hope the competition will boost the upgrade of Web APIs of all search engines in the near future, which will benefit developers and eventually end users.

Notions of Personalization in Industry- UCAIR Series (10) [tag: notion]

September 18, 2005 (China Mid-Autumn Festival)

Besides personalized search engines in industry, there are personalized portal and recommendation system, which is briefly discussed as follows.

Personalized Portal: My Yahoo is the pioneer in the personalized web portal, which includes personalized news, weather forecast, comics, and TV listing. The user can customize the personalized portal by setting user interested content, color, layout and etc. Findory is a web site which provides the personalized news service. Unlike My Yahoo, the user does not need to explicitly specify the user interest. Instead, the web site implicitly infers the user interests through the user interaction history on the web site. The more user browsing history is collected, the better personalized news articles selection is presented.

Recommendation System: Many E-Commerce web sites try to build personalized stores for each online customer. Amazon is the most famous one in building personalized web stores. They use collaborative filtering techniques to recommend stuff for the customers according to product purchased or viewed by customers before.

Notions of Personalization in Personalized Search Engine- UCAIR Series (9) [tag: notion]

September 17, 2005

Web search engines have achieved great successes in helping people find information on the Web, especially for simple information need such as homepage finding. However, search engines still perform poorly in many other tasks. There are many reasons to cause the poor performance of the search engine. Among them, two important reasons are frequently pointed out. First, many user queries are ambiguous or the user himself does not know how to specify the information need exactly. Thus the search engine can not infer the real user information need just according to the current user query. Second, information retrieval is an interactive process; users will adjust their queries during this process. Therefore, the search engine should also adjust the inference of user information need. Nevertheless, currently most, if not all, search engines use only the user's current query to do the search. Some search engine companies such as Google, MSN Search and Yahoo are trying to use contextual and personal information to help the search. Some search engines have already released the test version of personalized search such as Google. Yahoo co-founder Jerry Young said that the relevance of search is still the Holy Grail for any search application and the key challenge for Yahoo and all search companies going forward will be to find ways to increase the personalization of results, i.e., making sure that a user truly finds what he or she is looking for when typing in a keyword search.

Notions of Personalization in Human-Computer Interaction Community- UCAIR Series (8) [tag: HCI, interface]

September 16, 2005

Currently, there is much interest in the personalization of product interfaces. For example, mobile phones are now sold with replaceable colored covers, e-commerce sites learn a user preference, and word processors allow you to customize the menus and tool bars. In an HCI 2000 poster, the personalization is defined as follows.

            
  Personalization is defined here as a process that changes the functionality, interface, information
              content, or distinctiveness of a system to increase its personal relevance to an individual.



The motivation for personalization is divided into those that are primarily to facilitate the work, e.g., bookmarking a web page, and those that are primarily to accommodate social requirements, e.g., expressing the identity of the user. HCI community focuses on how to model user search behavior, what kind of user actions such as mouse moving are related to user interests, and how the system can extract useful information during the user interaction with the system to do personalization. In an IUI 2004 paper, the author studied the correlation of four mouse operations and user interests and used these mouse operations as the clue to extract some context keywords to do similarity search.

Notions of Personalization in Information Retrieval Community- UCAIR Series (6) [tag: information retrieval, contextual search]

September 14, 2005

Research in information retrieval has a long history dating back to 1950's. Over decades, significant progress has been made in developing retrieval models such as vector space model, probabilistic model and recently statistical language model, performing large scale empirical evaluation and building useful systems such as SMART, Lemur and Google. Nevertheless, almost all existing retrieval models and systems can be characterized as ``one size fits all". Only user queries are used to represent user information need and there is no representation of search context and user preference. Thus same queries submitted by different users are treated as the exactly same. A great amount of responsibility of finding relevant information is taken by the user. However, the ideal retrieval system should proactively incorporate both the user's search context and personal preference into the retrieval decision process. In a recent workshop about challenges in information retrieval and language model, personalization and contextual search is considered as one of two big challenges in information retrieval. They define the contextual search as follows.


                   Contextual Search: Combine search technologies and knowledge about query and search

                   context into a single framework in order to provide the most ``appropriate" answer for a user's
                   information needs.

However, despite recent attention to this problem, little progress has been made due to the difficulty of capturing and representing knowledge about the user, context and task in the general web search environment. Although there are many studies of retrieval models (by researchers of computer science) and user models and user information seeking process (by researchers of information science), the research in user model and retrieval model are currently is not well integrated.

Participants of the workshop believe that the future search engine should be able to collect use context and query features to infer characteristics of the information need unobtrusively. A retrieval framework integrating retrieval model and user model needs to be proposed, studied and evaluated empirically.

Web Browser, Search Engine and Toolbar - Industry Series (6) [tag: web browser, industry]

September 12, 2005 

The Web Browser is the most important window to the immense information on Internet. There are Internet Explorer (IE) (IE 7.0 is in beta testing), Firefox (more than 80 million downloads since its release on November 9, 2004), Opera (it just celebrated its 10th anniversary on August 30, 2005), Netscape (watch the drama of browser war between Netscape/Firefox and IE), Safari, and others.  Developers can add new functionalities into the web browser through add-ins such as Google toolbar.

The Search Engine helps the user find information on Internet. Google, Yahoo and MSN are dominant players in Search Engine arena (watch the drama of Microsoft's suit against Google and Kai-Fu Lee).  All of them offer IE toolbars, which help the user to search information without visiting search engine homepage, and APIs, which help developers add new functionalities based on those search engines.

UCAIR toolbar is an IE toolbar, which uses Google search engine search results as basic results. But so far, it does not make use of Google APIs. However, it is a choice under the consideration.

Information Sources of Search Engine Industry- Industry Series (5) [tag: search engine watch, industry]

September 11, 2005 

Search Engine Watch is a very popular electronic daily to provide information about search technology. Their staff also organizes the Search Engine Marketing conference several times each year around the world. This is a primary place to read news about search technology in industry. For each daily, they provide a lot of interesting links about the news of search engine industry. Search Engine Journal, WebMaster World, Search Engine Show Down, and Cre8site are useful electronic journals or forums to obtain news about search technology too.  

Some interesting blogs are as follows, Geeking with Greg (with more technology flavor), Google Blog (official Google blog), Yahoo Blog (official Yahoo Search blog?) (official MSN Search blog?).

Among these information bushes, we can pick the personalized search berry, which is a hot area for most, if not all, search engine companies.

Desktop Search Software and APIs - Industry Series (4) [tag: desktop search]

September 10, 2005

Desktop Search software is the search engine for the personal computer. There are Google, MSN, Yahoo (or X1), Copernic Desktop Search, all of which have free versions.  In order to let developers add functionality to the fledgling Desktop Search software, Google and MSN provide APIs. Google API documentation looks comprehensive and has a nice developer discussion group.

Using Desktop Search APIs, information retrieval researchers have more power to do research on the personalized search on the client side. They can have access to the index structure of user local files and build a better user model.

UCAIR Personalized Search Toolbar - UCAIR Series (4) [tag: UCAIR]

September 9, 2005

In UCAIR project, we develop a UCAIR Personalized Search Toolbar. The software can be downloaded from UCAIR project website. Currently, UCAIR Toolbar is an Internet Explorer plug-in and uses Google search results as basic results.  But it is a matter of engineering to integrate it with other web browsers such as Firefox or use other search engines such as Yahoo search results.

Compared with the personalization at the search engine server side, personalization at the client side as UCAIR toolbar does has the following advantages. 1) Privacy is a much less concern. The user interaction history will be strictly kept at the client side and the search engine can not store the information about what you have viewed. 2) On the client side, there is much richer user information than just keyword query and clickthrough data, which can be used to better infer the user model. For example, the user local files can be indexed to represent the user information interest. 3) The computation and storage cost will be reduced on the search engine side. The disadvantage I can see for the personalization at the client side is that there is no global index for all web pages so that the client side probably can not control the general retrieval function.

comScore has a report about  Search Engine rating in July 2005. Not surprisingly, Google maintains the lead with 36.5% share of search following Yahoo (30.5%) and MSN (15.5%). But for the search submitted from toolbar, Yahoo tops the share. "Yahoo toolbars processed more than 282 million searches during the month, a 74-percent increase over the previous year".  A more interesting number related with the personalization at the client side is the ratio of searches submitted from toolbar over all searches. "In July, 11 percent of all domestic searches were conducted via toolbars, up from 8 percent in July 2004."  From this number, we can see that indeed a lot of searches (11%) are submitted from toolbar and if the personalization functionality is added into the toolbar, this percentage number is expected to increase since the user will see more relevant web pages returned to them using the personalized search toolbar to do search [See our CIKM 2005 paper for a user study about personalized web search].  

Findory Personalized News - Industry Series (2) [tag: Findory, Personalized News]

September 8, 2005

I got to know there was a Findory website around a year ago. I was excited to see that Findory provides personalized news service. The idea behind the Findory is similar with that of Amazon, i.e., collaborative filtering in recommendation systems. Actually, the founder of Findory (Greg Linden) worked in Amazon Personalization Group before and had written a paper Amazon.com Recommendations: Item-to-Item Collaborative Filtering in IEEE Internet Computing, January 2003.   

The collaborative filtering idea is easily applied on the server side if the privacy is not a big concern. Amazon has achieved great success using the same idea. Moreover, the more user interaction data (e.g., keyword queries and viewed web pages) are collected, the better personalization it is supposed to achieve. However, for the personalization on the client side, which reduces privacy concern greatly, it is hard to get interaction history of other people and thus collaborative filtering technique can not be used in personalized search on the client side. On the other side, personalization techniques at the client side can exploit much richer user information such as user desktop index to infer a better user model and thus improve retrieval accuracy.

Compared with Findory, Google News has the customization functionality, i.e., the user can select keyword query terms he likes, e.g., "Personalized Search" and then relevant news articles searched will be put in a new category. Moreover, the layout of different categories of Google News can be changed by the user. I am not sure when Google Personalized Search launched in late June will be integrated with the Google News.  

IRiX Workshop at ACM SIGIR 2005 - SIGIR Series (3) [tag: IRiX, SIGIR 2005]

September 6, 2005

The second IRiX  (Information Retrieval in Context) Workshop at ACM SIGIR Conference was held on August 19, 2005. At SIGIR 2004 in Sheffield, UK, the first IRiX workshop was organized by Peter Ingwersen, Keith van Rijsbergen, and Nick Belkin. The proceedings and workshop report of the first IRiX workshop are available at SIGIR Forum. The second IRiX workshop was organized by Peter Ingwersen, Kalervo Järvelin, and Nick Belkin. The program and proceedings are also available.

At the workshop, besides several presentations, there were very interesting discussion groups. I joined the Group A Context, Situation, and Task: Implications for IR led by Nick Belkin. We discussed what are context, situation and their relationship. After the discussion, we consider situation as a snapshot of values of context variables. There are user, system and environment facets of context. We try to enumerate the lower level facets. See the Group Notes for the discussion results.

Interestingly, there is another IRiX workshop held in Glasgow in July 2005, which was organized by Joemon Jose and Keith van Rijsbergen.  I think the next important steps for IRiX research in academia include preparing an evaluation data set (there is a test data set available for contextual search from UCAIR project, but far from enough), proposing the retrieval model to incorporate the context information, and making a real system to demonstrate the power of search in context. 

Jaime Teevan's Personalized Search Paper - SIGIR Series (2) [tag: Jaime Teevan, Susan Dumais, SIGIR 2005]

September 5, 2005

At ACM SIGIR 2005, Jaime Teevan presented her work of Personalizing Search via Automated Analysis of Interests and Activities with Susan Dumais and Eric Horvitz. This is a very interesting and very solid work for personalized search. In their problem setting, they do personalized search at the client side and use result reranking to do personalization. Their reranking formula is borrowed from the relevance feedback. However, the feedback documents are implicitly rather than explicitly provided by the user. They do a study about how to get the statistics of the formula such as tf and idf. There are several variations of collection/corpus representation, user representation, and document and query representation.  Two interesting finding are that the performance of  the personalized results purely based on user profiles is actually worse than that of original web search results and that the mixture of the personalized ranking and original web ranking is needed to achieve better results.  

This work can be considered as the work using long-term context to improve the retrieval accuracy. It is not clear how the work selects recent index. In my opinion, the context information in the same information seeking session is considered as the short-term context, which maybe is more important than long-term context for improving the search results. One problem of short-term context is that they are generally very sparse. However, the long-term context can be used for the smoothing.  I think that one interesting future work is to explore to how to differentiate user profiles stored on the local disks instead of treating them equally and use them in a finer granularity.  

Google Personalized Search - Industry Series (1) [tag: Google Personalized Search, industry]

September 4, 2005

Google Personalized Search was launched on June 28, 2005, which was reported in Search Engine Watch Blog. Before that, there was another Google Personalized Search, which asks each user to create a profile explicitly (e.g., which category information (e.g., kids and sports) you prefer to), and then personalizes the search results according to categories you select. The new Google Personalized Search instead uses user interaction history as implicit feedback information to infer the user interest and personalize the search results. They store the user query history and clickthrough data on Google servers. When the user submits a query, the query history and clickthrough data are used to personalize the search results. No details about how they utilize the user interaction history in personalization are provided. It is supposed that the more user interaction history data are collected, the more personalized relevant search results are returned.

Charlene Li has a blog to discuss the launch of Google Personalized Search. Several issues such as privacy and possible improvements mentioned in the blog are actually addressed in UCAIR project (see the next paragraph).  Geeking with Greg has several thoughts about Google Personalized Search too. One issue pointed out in the blog is that Google personalized search does not provide the user the information about which results are pushed up by Personalized Search (UCAIR project does tell the user which results are pushed up by personalization) and why some results are pushed up by personalization.

Compared with Google Personalized Search, UCAIR project personalizes the search result on the client side. The user interaction history is totally stored on the client side so that the privacy is not an issue as it is for Google Personalized Search. Moreover, putting personalized search on the client side, we can utilize more user information on the client side to do further personalization. For example, we can make use of the local files on the user's hard disk and bookmarks of web browser as long-term user interest to personalize the search results. It is expected that some search engine toolbars will introduce such functionality in the future.

Thorsten Joachims' Implicit Feedback Paper - SIGIR Series (1) [tag: Thorsten Joachims, clickthrough data]

September 3, 2005

At SIGIR 2005, Professor Thorsten Joachims and his coworkers have a very good paper, Accurately Interpreting Clickthrough Data as Implicit Feedback. In this work, they study how the user clickthrough data is reliable as the source of implicit feedback. They do two related user studies. One user study utilizes eye tracker to record user eye movement to infer the user browsing behavior (viewing, clicking and the relationship between them). The other user study asks the participants to explicitly judge relevance of search results so that the correlation between implicit feedback and explicit relevance feedback can be studied. The main finding of this work is that the relative relevance judgment (e.g., one search result is more relevant than another search result) rather than absolute relevance is more accurate.

This paper mainly conducts the user study to get some insight into the clickthrough data as implicit feedback. They have a related paper Optimizing Search Engines Using Clickthrough Data in SIGKDD 2002, which talks about using SVM to train a retrieval ranking function according to clickthrough data. In the SIGKDD 2002 paper, they already used the idea of using relative relevance instead of absolute relevance. 

A follow-up paper of SIGIR 2005 paper is Query Chains: Learning to Rank from Implicit Feedback in SIGKDD 2005. They used clickthrough data in training a retrieval function of a search engine. The SIGKDD 2005 paper can be considered as a combination of the SIGKDD 2002 and SIGIR 2005 paper.      

This work can also be considered as a finer-granularity study of how to make use of clickthrough data in contextual search. Our paper  Context-Sensitive Information Retrieval Using Implicit Feedback in SIGIR 2005 is focused on how to model clickthrough data into the contextual search.  

SIGIR 2005 Papers from UCAIR Project- UCAIR Series (2) [tag: UCAIR, active feedback, implicit feedback]

September 2, 2005

There are two papers from UCAIR project published in SIGIR 2005 . The title of one paper is Active Feedback in Ad-hoc Information Retrieval by Xuehua Shen and ChengXiang Zhai. This work studies how to do document selection for user relevance judgment if the user is willing to judge relevance of some documents, while the traditional relevance feedback is focused on how to do query term expansion and query term reweighting given user judged feedback documents.  We propose a preliminary framework and several active feedback methods in this paper; the title of the other paper is Context-Sensitive Information Retrieval Using Implicit Feedback by Xuehua Shen, Bin Tan and ChengXiang Zhai. This work studies how to model user interaction history (implicit feedback) to improve retrieval accuracy. We propose four statistical contextual language models to incorporate context information. 

CIKM 2005 Paper from UCAIR Project - UCAIR Series (1) [tag: CIKM 2005, motivation of personalization]

September 1, 2005

A common major limitation of existing retrieval models and systems is that the retrieval decision is, in general, based solely on the query and document collection; information about the actual user and the search context is largely ignored. This limitation makes the retrieval performance of existing IR systems inherently non-optimal, as seen clearly in the following two cases:

  • Different users may use exactly the same query to search for different information, but existing IR systems return the same results for these users. For example, the query ``IR applications'' on Google returns a mixture of documents about ``information retrieval'' applications and ``infrared'' applications, as ``IR'' can be an acronym for both information retrieval and infrared. Without considering the actual user it is inherently impossible to know which sense ``IR'' refers to.
  • A user's information needs may change over time. The same user may sometimes use ``java'' to mean the Java island and some other times use ``java'' to mean the programming language. Without recognizing the search context, it would be again inherently impossible to recognize the correct sense.
It is therefore clear that an optimal retrieval system must incorporate both user information and search context into the retrieval decision process.

The UCAIR ( pronounced as "you care", means User Centered Adaptive Information Retrieval) project seeks to break this limitation of the existing retrieval methods and formally develop a new retrieval paradigm called user-centered adaptive information retrieval (UCAIR), in which user information and search context are both exploited to improve retrieval performance.

Here is a paper which will be published in CIKM 2005 . The paper is Implicit User Modeling for Personalized Search  by Xuehua Shen, Bin Tan, and ChengXiang Zhai. There are two main contributions of this paper. One is to propose a decision theoretic framework and develop techniques for implicit user modeling in information retrieval. The other is to develop and evaluate a client-side personalized search agent UCAIR