Personalized
Search
It can also be viewed from
ucair.wordpress.com. Comments or suggestion? Please send
them to 
Cryptography
and Privacy preservation in personalization [tag: zero-knowledge proof,
privacy]
March 18, 2007
Avi Wigderson gave three lectures at
Princeton
public lecture series. His three talks are about
computation/computability, computational complexity, and
cryptography. In the lecture about cryptography, he talked about
zero-knowledge proof, private communication, and oblivious
communication.
I hope that these techniques can
be applied to privacy-preserving personalized search. In the
wishful thinking of the privacy-preserving personalized search
of my SIGIR Forum
paper (Level IV no personal information), the search engine
can return relevant results to the user after the user submits a
query. At the same time, the search engine does not know what
query terms the user submits are.
P.S. The Google changed the
privacy policy of search engine logs last week. Google will
remove the last 8 bits of 32-bit IP address associated with each
query after storing them for 18~24 months.
How
much a Search Engine company can make for each search
March 17, 2007
Recently, Yahoo! began to use their
new ad system Panama and hopes to reduce the gap of money-making
power between Google and Yahoo. From an
article on December 26, 2006 of Business Week, I got to know
that Tim Boyd, a financial analyst of Caris & Co. estimated that
Google makes 20 cents per search while Yahoo! makes 10 cents per
search. During a visit, I told this number to a friend. My
friend said he got a different number and sometimes the number
from a financial analyst should be double-checked. I agree with
the viewpoint of my friend. Moreover, Jon
Bentley also suggested that we should use
"back-of-the-envelope" calculations, a standard fare in
engineering schools. Here is my "back-of-the-envelope"
calculation about the Google's money-making power.
In Q3 2006, the total revenue of
Google is $2.690 Billion according to Google income statement.
According to Nielson//NetRating data, Google received 2.776
Billion queries (49% US search share) in July 2006, 3.003
Billion queries (50%) in August 2006, and 2.826 Billion queries
(50%) in September 2006. Thus in Q3 2006, there are 8.605
Billion queries submitted to Google. If we assume that all
revenue of Google comes from Ad (AdWord or AdSense), then on
average Google makes $2.690Billion / 8.605Billion query =
$0.31/query, i.e., 31 cents per query.
In Q4 2006, the total revenue of
Google is 3.205 Billion. According to Nielson//NetRating data,
Google received 3.022 Billion queries (50%) in October 2006,
3.098 Billion queries (50%) in November 2006, and 3.036 Billion
queries (51%) queries in December 2006. Thus in Q4 2006, there
are 9.156 Billion queries submitted to Google. On average Google
makes $3.022 Billion / $9.156 Billion query = $0.33/query, i.e.,
33 cents per query.
From the simple calculation of Q3
2006 and Q4 2006, we can see Google indeed makes around 30 cents
per query on average. Since Yahoo! revenue comes from diverse
sources, it is difficult to compute the Yahoo! number according
to the number of Nielson/NetRating and Financial report.
New
Google Personalized Search
February 22, 2007
Recently, Google pushes personalized
search. They now have the personalized homepage, search history
and personalized search results. I tried the personalized search
and it seems that it is not clear whether they do personalized
search or not for a specific query. I think it is one aspect
that Google can improve, i.e., get each user informed when
personalization happens and which results are personalized
results. About this, Marissa Mayer said in an
interview
One thing that we've
struggled with is if we should actually mark the results are
entering the page as a result of personalization but because
team is currently and frequently doing experiments, we
didn't want to settle on a particular model or marker at
this exact moment.
Marissa Mayer, VP of Google, said
in the
interview
The actual implementation
of personalized search is that as many as two pages of
content, that are personalized to you, could be lifted onto
the first page and I believe they never displace the first
result, because that's a level of relevance that we feel
comfortable with. So right now, at least eight of the
results on your first page will be generic, vanilla Google
results for that query and only up to two of them will be
results from the personalized algorithm. I think the other
thing to remember is, even when personalization happens and
lifts those two results onto the page, for most users it
happens one out of every five times.
I like the idea of combining
personalized search results and generic search results together.
In my thesis, I proposed progressive personalization. When the
search engine is not confident about the user intention, it can
present generic results to the user and at least must not annoy
the user by pushing unrelated personalized results; when the
search engine are confident about the user intention, it can
push personalized results to the user.
In a summary, Google is pushing
personalized search in a conservative way.
Google and Kaltix [tag: Kaltix]
February 21, 2007
Besides Outride, Google acquired
Kaltix in September 2003. Here is the
press
release from Google and an
article
from CNET about Kaltix in August 2003. There are three founders
in Kaltix and they may be Taher, Haveliwala, Sepandar Kamvar,
and Glen Jeh. They co-authored a paper to do analytic
comparison of personalized PageRank.
Initially, each guy has a
first-author publication related with personalized PageRank.
Haveliwala: Topic-Sensitive
PageRank, WWW02;
Jeh: Scaling Personalized Web
Search, WWW03;
Kamvar: Extrapolation Method for
Accelerrating PageRank computation, WWW03.
Recently, Professor
Junghoo Cho from UCLA has a related publication: Automatic
Identification of User Interest for Personalized Search, WWW06.
His work is to incorporate implicit feedback into the PageRank.
Google and
OutRide [tag: Outride]
February 20, 2007
Recently, Google introduced more
personalization technology at their website, which I will review
later. But back to September 2001, Google had already acquired
the outride, a startup of doing personalized search. Outride is
a spinoff of Xerox PARC (Just recently, Xerox PARC has a deal
with the search engine startup Powerset to do natural language
search).
Outride was founded by Jim Pitkow,
Hinrich Schutze, and Todd Cass. It is one of earliest systems
doing personalized search. The most relevant publication about
Outride is an
article of Communication of ACM. From the article, I can see
that outride is also doing personalization at the client side
and uses query augmentation and result reranking techniques. It
looks that they implemented a plug-in of web browser (sidebar),
like toolbar. From the paper, there are not many technical
details revealed.
UCAIR emphasizes the eager
feedback, i.e., when the user has the interaction with the
retrieval system such as selecting a web page, the system can
make some responses, e.g. updating the user model. UCAIR is
based on the decision-theoretic framework and context-sensitive
statistical language model.
Haveliwala's Topic-Sensitive PageRank [tag: PageRank, Topic-Sensitive
Retrieval]
February 18, 2007
I reviewed Haveliwala's
Topic-Sensitive PageRank paper, which is the best student paper
in WWW 2002. This work is one of early research efforts in the
personalized search based on PageRank algorithm. I think it is a
really solid work. The author used Stanford WebBase crawler to
crawl a part of the Web and ODP to build a personalization
vector and a probability distribution of query words given each
topic. The author used overlapping rate and a variant of
Kendall distance as the evaluation metrics. Besides that, author
also conducted a user study to evaluate the performance of
topic-sensitive PageRank. In the end, the author also mentioned
some potential interesting problems and directions about
personalized search such as privacy and the discovery of query
context.The idea is to compute a list of
PageRanks (instead of a single PageRank) for each web page,
i.e., for
each topic, there is a PageRank score for each web page. This
topic-sensitive PageRank score can be computed according to the
web graph and the topic classification of each web page using
ODP data. Then for each user query, search engine computes the
probability distribution of topics for this query and compute a
weighted average (weight is the PageRank score of the topic)
as the final rank score. For the probability distribution of
topics for each query, search engine can check the query words
and get the distribution directly.
Search engines can also compute the probability
distribution according to the query and its context. Authors
conducted a user study (5 users and each user did 10 queries).
This work is done at the server
side and can directly be applied to the search engine. But it
can not be directly applied at the client side since client side
search agent does not have the web graph. The topic selection is
at the coarse granularity since it just uses the top-level ODP
topic categories. For each individual person, we can also have a
topic category.
Two
talks about Search Security [tag: Privacy]
January 28, 2007
There are two talks related with
search or search personalization.
One is a talk about
search privacy, by Dr.
Lorrie Faith Cranor, a professor at CMU.
The other is a talk about
Secure Personalization: Towards Trustworthy Recommender Systems,
by Dr. Bamshad
Mobasher, a professor at Depaul.
Sunset of Findory [tag: Findory, Personalized News]
January 27, 2007
Today, I got to know that
Findory, a personalized news website, “rides
into the sunset“. It is a sad news. But I believe
that personalization technology will succeed somewhere
in the real-world applications.
A Talk about Privacy-Enhanced
Personalization [tag: Privacy-Enhancing]
January 26, 2007
I found that there is a
talk by Dr.
Alfred Kobsa, a professor at UCI. The title of the
talk is
Privacy-Enhanced Personalization. It should be very
relevant to my thesis research on
privacy-preserving personalized search.
Susan
Dumais' Personalized Search Talk at Yahoo! Research [tag:
personalized search, Stuff I've Seen, implicit query, Phlat]
January 07, 2007
From
Greg Linden's blog, I got to
know Susan gave a personalized search
talk at Yahoo! Research. Video of the talk is available at
Yahoo! Video. Susan will also come to the town on
Mar 26, 2007 and give a talk on Information Retrieval in Context.
Update on January 8, 2007: Susan
Dumais was named as
the ACM fellow for her work of information retrieval and
human-computer interaction. In recent years, Susan did a lot of research
on personalized search and had several influential projects such as
Stuff I've Seen, Implicit Query, and Phlat. In Class of 2006, there are
three fellows doing information retrieval research. Besides Susan,
there are Giles, C Lee (CiteSeer) and Peter Norvig (Google). There are
also quite a few fellows doing database and data mining research.
ACM
Recommendation Policy on Privacy [tag: privacy, search engine
log]
January 05, 2007
In June 2006, US ACM published a
recommendation policy on privacy on ACM website (Visit
http://www.acm.org/usacm/Issues/Privacy.htm for the content). To
strike a balance between individual privacy protection and valid
governmental and commercial usage, ACM recommends minimization, consent,
openness, access, accuracy, security, and accountability.
In August 2006, there was
AOL search log incident. Now, the search engine
has become an indispensable tool for people in daily life. However, many
people may not be aware that search engines actually store a lot of
personal information and can potentially reveal a gamut of individuals'
private lives such as medical history and hobbies. I think compared with
recommendations of ACM, search engine companies have a long way to go.
For example, people currently virtually have no access to search engine
logs, although no personal identity is stored at the search engine side.
Moreover, the search engine logs probably are stored at search engine
data servers indefinitely.
Some search engines such as Google
Personalized have implemented personalized search functionality, some
interfaces are provided for the user to modify these data. For example,
Google let users delete search history entries one by one. But it is
still not convenient for users. For example, users can not remove
several entries in a batch mode.
Some
Statistics Related with Web Search [tag: statistics, search
engine, query, monetization]
December 26, 2006
Number of Indexed Web Pages
A couple of years ago, search engine
competed on how many web page they indexed. They continuously put larger
and larger number on their home page and sometimes one party which wrote
a smaller number argued that other parties overestimated the number or
had different methods of calculation. Recently, Google removed this
number from the home page. Later, Yahoo! and MSN followed. It seems that
the number of indexed web pages is not so interesting any more. Many
people estimate that there are tens of billions of web pages on the
"surface web" and far more hidden web pages from searchable databases in
the "deep web".
Number of Queries
Instead of having a war on the number of
indexed web pages, currently search engines compete on how many queries
users submit to their search engines, which is directly related with the
revenue of a company. There are some Internet media research companies
are reporting these numbers. The most frequently quoted numbers are from
Nielson//NetRatings and
comScore. Nielson//Rating has a
monthly report about query shares of search engines. In November 2006,
an estimated 6.2 billion queries were conducted at U. S. search
engines. Google is on the top and has 3.1 billions queries (49.5%
share). The following search engines are Yahoo! (24.3%), MSN (8.2%), AOL
(6.2%), and Ask (2.6%). Monthly estimates of U.S. search engine
queries in the second half of 2006 by Nielson//Ratings are as follows.
October: 6.0 billion; September: 5.6
billion; August: 6.0 billion; July: 5.6 billion; June: 5.4 billion;
June: 5.7 billion.
Google consistently takes 50% share of
U.S. search queries. Yahoo! is around 25%~30% and MSN is around 8%~10%.
Monetization of Queries
Although the number of conducted search
queries are directly related with the money that search engine companies
can make, but it is not proportional to ad revenue. Ad revenue also
depends on the advertisement auction and placement system. For
example, although Yahoo! search share is about 1/2 of Google share,
Yahoo! ad revenue is only about 1/4 of Google ad revenue. According to
the estimate of Caris & Co. analyst Tim Boyd,
"Yahoo made on average between 10¢
and 11¢ per search in 2006, bringing in a total of $1.61 billion for
the first nine months of the year. Google, meanwhile, makes between
19¢ and 21¢ per search. As a result, it made an estimated $4.99
billion during the same period." (Quoted from
an article of BusinessWeek)
We can see that on the average, each
submitted query can make 20 cents for Google and only 10 cents for
Yahoo!
Collarity, a Startup of Personalized Search [tag: startup,
personalization, industry]
November 26, 2006
From the blog of
Venture Beat, I got to know there is another startup doing
personalized search
Collarity. There are
already quite a few startups doing personalized search such as Surf
Canyon.
I tried this version after registering an
account. There is a slider called relevance compass, which can let
individual users continuously tune the search results from the extremely
personalized level through community level to totally population level.
This implementation is same as what Microsoft Researcher Susan Dumais
did for personalized search with the former intern Jaime Teevan.
After trying some queries and clicking some results, I could how my
search results got personalized. Maybe it is still in the early stage of
the company. There are some different suggested terms appearing at the
bottom of the compass when users move the slider. But the speed is slow
and the suggested term is needed to be selected by user for the addition
into the query. Here is a paragraph from Venture Beat about the
Collarity.
"Levy Cohen, chief executive of
Palo Alto-based Collarity, said he got his idea to launch Collarity
because it bothered him that Google returns the exact same results
to people even if they have different interests. If you’ve searched
for information on Linux before, then the search engine should
return results relevant to open source, he said. Moreover, if you
search for “Java,” the search engine should know whether you’re more
likely interested in the computer language, or coffee."
-Venture Beat
Collarity claims to use the search result
of people "like you" to personalize the search results. However, I can
only imagine that using other similar users' interest, we can at most
get the community/group level personalization. If we really want the
personal level personalization, we should use the user's own user
profile. The idea of Collarity is the collaborative filtering idea,
which is extensively used in recommendation systems such as those at Amazon.com and Netflix. But most personalization research in academia is
focused on exploiting the user's own profile. On the other hand, we may
combine these two ideas (i.e., item-based and user-based).
One comment mentions that the Collarity
is similar with a demo of Yahoo! Research, i.e.
Mindset. I find that
the interface of Mindset also uses the slider to vary the results from
shopping to research.
Andrei Broder's Information Supply Talk [tag: personalization, privacy, implicit query]
November 25, 2006
From the blog of
Geek with Greg, I got to know
the talk by Andrei Broder on Information Supply. The slides are
available
here.
In the slides, Andrei Broder wants to
express his opinion about the next generation web search. In his mind,
Information Supply should be the next step of Information Retrieval. He
mentions that search engine can infer the user information need and
provide relevant information to the user even without the user explicit
query. Actually, some research works done by Susan Dumais and Mary
Czerwinski on Implicit Query is in this direction.
I think Andrei Broder's information supply vision matches contextual
search/personalized search vision. We need to infer the user information
need to understand the user real intention so that we can get better
search results. Currently, the user can easily find satisfactory
results from the Web such as finding a homepage of a person or a
company. However, the searchers can not find a satisfactory answer for
many search tasks too. We need to do research on improving the user
search experience or information seeking/acquisition experience.
Andrei Broder gives some general ideas
about how the information supply should work. However, he did not give
some concrete problems we need to attack. I think here are some problems
we will face.
1) What kinds of information
seeking activities can personalized search help? I do not
think personalized search can help every search. For some search tasks,
personalization can even deteriorate the search experience because of
imprecise user modeling. Maybe personalization should target at the
difficult information seeking activities.
2) How should privacy issue be dealt
with? Privacy is a big concern of personalized search because a lot of
personal information will be disclosed and can be potentially abused. We
need to study how different levels of privacy can fit different
individual user's acceptable privacy levels, how the personalized
software architecture should be chosen and how we can implement the
personalization systems to guarantee the appropriate privacy protection
levels.
3) How should personalized search
interact with the user? The user may not be willing to actively
participate in the personalization search process. In such cases, we need
to consider how to do personalized search in an implicit way. If the
user is willing to contribute to personalized search, we need
to think a way to get the user involved. Moreover, how should we design
the user interface to make the user understand how the personalized
search work instead of assuming the user simply accept the black box
magic of personalized search. How should we design the personalized
search interface to facilitate the personalization process?
Some other questions have been proposed in
previous blog entries.
Vertical Personalized Search [tag: vertical search,
healthcare, law, Healia]
November 21, 2006
I talked with a researcher about
the personalized information management in the healthcare domain.
Contextual search is considered as a promising way to improve the
information seeking of practitioners in a specific domain. It is
interesting to see that vertical personalized search or personalized
search in a specific domain has been given a lot of attention. For
example, Healia is a
startup to provide personalized health information retrieval service in
the health domain.
So far, I have known two domains which
are interested in the personalized search, law and healthcare. For both
of these domains, people have to look for the needle in the
haystack and people really care to find relevant information even by
interacting with the retrieval system for many iterations for a single
information need, which provide the opportunities for the personalized
search algorithm to get enough information about the user intention.
However, I also show the concerns about
the feasibility of the application of personalized search in these
domains. For example, I met a researcher in the law information system
company, who complained that lawyers did not want to try the
personalized search prototypes because of the privacy concern. Thus I
also wonder what the opinion of the doctors about the personalized
search is. Thus in order to apply personalized search in a specific domain,
we may need to do some survey to investigate whether the people in this
domain really like and accept the idea or not.
But I will think vertical personalized
search will become more popular in the future, not restricted to
healthcare or law domain.
local.live.com expires in year 4001[tag: privacy, cookie]
September 05, 2006
I checked the cookies on my web browser
Firefox and found that one cookie of local.live.com has the following
attributes.
Name: SerializationVersion
Content: 2
Host: local.live.com
Path: /
Send for: Any type of connection
Expires: Thursday, February 15, 4001 11:59:00 PM
Can we imagine what the world will be in year 4001?
I checked cookies of many websites and found it is common that the
expiration date of cookies are set far beyond the death of my laptop,
year 2011 of mail.google.com, year 2016 of microsoft.com, year 2036 of
amazon.com, year 2037 of yahoo.com....
Here is some information about the Internet cookie
http://webmaster.info.aol.com/aboutcookies.html.
We need to seriously think about the
privacy and security of Internet browsing behavior now. Same for
Internet search activities. We need to care the user privacy for
personalized search too.
How
Was the AOL Searcher No. 4417749 Identified? [tag: privacy,
search engine log]
August 24, 2006
There is a NY Times report on August 9,
2006 titled as
A Face Is Exposed for AOL Searcher No. 4417749. A
lady in Georgia was identified and a photo of her was put on the
NY Times website too. Here is how her identity was discovered. The
searcher No. 4417749 searches "landscapers in Lilburn, Ga", “homes sold
in shadow lake subdivision gwinnett county georgia”, "retirement
communities for single women", multiple times "eugene oregon jaylene
arnold" or "jarrett t. arnold". An investigator, maybe a reporter, came to the town of Liburn, GA and checked several people with the name
Arnolds. Thelma Arnold, a 62-year-old widow who lives in Lilburn,
then said "“Those are my searches,” after the reporter read part of the
list to her.
Privacy is a serious issue of
personalized search research. Put the personalized search on the client
side can alleviate the privacy concern.
Update: At Eric Selberg's
blog (08/09/2006 entry), there is a link of DexOnline
(online phone book), which lists 25 Arnolds in Lilburn, GA. He suspected Ms. Arnold was
tracked down using high-tech means such as calling all the Arnolds in
Lilburn, GA.
Partnership of Yahoo and EBay
[tag: Web2.0, industry]
May 25, 2006
In industry, news of partnership of
Yahoo and eBay boost the share prices of both companies. Many people
think it is a win-win situation. I think so too. Both eBay and Yahoo
really need some good news to boost the confidence of investors. Google
is eating away the search share of Yahoo. GBuy and Google Base are threats to
eBay and Paypal.
From the technology perspective, Yahoo
now forwards in the social media direction. For Yahoo, one potentially
advantage of partnering with eBay is the huge user base of eBay.
Moreover, many eBay users are very serious and loyal. Like MySpace and Facebook, Yahoo!
can build a big social network based on shoppers and businessmen of
eBay.
Yahoo can provide the personalized search and recommendation system
service to the eBay users.
Personalization and Web 2.0 [tag:
Web2.0]
May 22, 2006
Web 2.0 is hot. O'Reilly
believes that one important feature of Web 2.0 is collective intelligence.
I consider the collective intelligence as the same thing as
manpower or
mass
collaboration.
Does Personalization belong to Web 2.0?
In my opinion, it does not in the narrow sense since the personalization
technology does not necessarily utilize collective intelligence.
However, personalization is strongly related with recommendation
systems, collaborative filtering and social network, which belong to Web
2.0. Thus it belongs to Web 2.0 in the broad sense.
PIM Workshop of SIGIR 2006 [tag:
personal information management, UCAIR, SIGIR]
May 21, 2006
At SIGIR 2006, there is a two-day
workshop
about Personal Information Management (PIM). We submit a
paper about capturing and exploiting personal search history to improve
retrieval accuracy. Here is the abstract of the submission.
Personal search history is an
important type of personal information that is critical for learning
a user's interests and information needs
and can be exploited to improve the search service for a user. In
this paper, we describe our recent work on User-Centered Adaptive
Information Retrieval (UCAIR), which aims at capturing personal
search history with a client-side search agent and exploiting the
history information to help a user optimize search results.
We propose a decision theoretic framework and develop techniques for
implicit user modeling based on a user's personal search
history. We propose several context-sensitive retrieval
algorithms based on statistical language models to combine the
personal search history with the current query for better ranking of
documents. Using these techniques, we have developed an intelligent
client-side web search agent, i.e., the UCAIR search agent, which
can automatically capture a user's personal search history, store it
in XML format on the local disk, and exploit it to provide
personalized search.
Watson Commercialized [tag: Watson, industry]
December 12, 2005
Today, I read an article of
Chicago Tribune (free registration) about the software
Watson. Watson is commercialized after a quiet period.
There are two academic papers about Watson project, one is the
IUI 2001 paper and the other is
the
JASIS 1999 paper, both of which are coauthored by
Jay Budzik and
Kristian Hammond. It is interesting to see that this academic project
got commercialized.
I do not try to install Watson, although
it is free. It looks pretty like Google desktop search and I have
installed Google desktop search on my laptop. From the research point of
view, I did not find any new feature provided by Watson from the demo at the website
so far.
Back
Button of Web Browser in Personalization [tag: UCAIR, web browser]
December 11, 2005
UCAIR toolbar changes the semantics of
Back button of the web browser. Using Internet Explorer with UCAIR
toolbar, when the user clicks one result of search result page and then
clicks the Back button, the user will see different contents of search
result page. This is because the UCAIR personalized search agent updates
the user model immediately after the user makes an action (click a
result link) and rerank the search results according to the updated
user model. So the user will see reranked search result page, which
probably is different from the page previously seen by the user. Thus the semantics
of back button has changed after the installation of UCAIR toolbar.
During several demos of UCAIR toolbar,
many people are interested in the semantics change of the back button. A
lady said she would like to see the same stuff as before after clicking
the back button. Some people are interested in how to minimize the
confusion brought to the user with the semantics change such as where
pushed up results should be places if UCAIR
toolbar has to change the semantics of the Back button.
I found the breaking of Back
button was considered to be
one of top web
design mistakes by Jakob Nielsen in 1999. The semantics of
Back button is a question for the web design now, especially with many
dynamic web design techniques such as
Ajax. What
does the user expect when he clicks the Back button? Probably the answer
will not be consistent. There is some research works on the Back button of
web browser such as
Getting Back to Back by Saul Greenberg and Andy Cockburn.
Personalization and Privacy [tag: privacy]
December 10, 2005
There is a book Make It Personal,
which is about personalization, privacy and profit.
Here is this book's Amazon link. The author is Bruce
Kasanoff. This book talks about how to do one-to-one marketing
without invading privacy. There are some good reviews about this book at
Amazon, especially the review of Peter Leerskov. This book looks a good
e-commerce book. Personalization in e-commerce is still a buzzword. We
can easily see there are so many websites which claim to be personalized
websites.
For the personalized search, recently it
is also a very active research area in ACM SIGIR community and search engine
industry. Privacy is a companion word of personalization, although
industry looks to be much more serious about this problem than academia (I know
ACM SIGMOD community is doing a lot of research on
privacy of database.).
There are some bills about privacy.
EPIC (Electronic Privacy Information
Center) is a good resource of online privacy including the
bill-track,
where you can find bills related with privacy passed by 105th-109th
Congress.
A
Discussion about Personalization [tag: Vivisimo, industry]
December 6, 2005
Long time ago, I mentioned
the
Vivisimo CEO's comments about personalization. I
just found that on Greg Linden's blog, Greg has
a post and there are some interesting follow-up comments.
Again, generally I disagree with the
"dead end" viewpoints. But we need to do solid work to demonstrate the
advantage of personalization technology.
Implicit Feedback, Pseudo Feedback, Relevance Feedback and Active
Feedback - UCAIR (14) [tag: implicit feedback, pseudo feedback
and relevance feedback, active feedback]
October 21, 2005
Implicit feedback is a popular way to do
personalized search. But general audience may confuse it with pseudo
feedback and relevance feedback. So it is worth making a
clarification here.
Relevance feedback in information
retrieval research was proposed in the 1970's by Gerald Salton and his
co-workers as a way to improve retrieval accuracy. Relevance feedback
works in the following way. After the user submits a query, the
retrieval system will do the first run to rank documents and then
present a few top ranked documents for the user to explicitly judge the
relevance. After getting the user relevance judgment of these documents,
the retrieval system will combine these judged documents with the
original query through query expansion to do the second run and present newly ranked documents to the user. A lot of empirical
evaluations show that relevance feedback is an effective way to improve
the retrieval accuracy. Rocchio feedback formula is the most popular
formula to do relevance feedback using vector space model. Model-based feedback proposed by ChengXiang Zhai in his CIKM 2001 paper
is a popular way to do relevance feedback using statistical language
model.
However, in many retrieval tasks such as
web search, the user is not willing to provide the relevance feedback to
the retrieval system. So pseudo feedback was later proposed. Pseudo
feedback works in the following way. After the user submit a query, the
retrieval system will do the first run to rank document and pick a few
top ranked document. These top ranked documents are assumed to be
relevant by the retrieval system and are combined with the original
query through query expansion to do the second run. The
retrieval system presents newly ranked documents to the user. Here we
can clearly see that relevance feedback needs user involvement in the
relevance judgment process while pseudo feedback does not. A lot of
empirical evaluations show that pseudo feedback generally, but not
always, can outperform the baseline retrieval. However, pseudo feedback
is not as effective as relevance feedback.
Relevance feedback is not applicable in
many search activities, while implicit feedback totally excludes the
user in the feedback process. So either relevance feedback or implicit
feedback has limitations. In interactive information retrieval such as
web search, the user generally has many interactions with the retrieval system. During these
interactions, the user gives a lot of hints to the retrieval system,
which can help the retrieval system infer the user's information need
better. Thus implicit feedback was proposed. Implicit
feedback works in the following way. The retrieval system will store
user interaction data such as query and clickthrough history, infer
the user's information need better through these interaction data, compose the
new query to rank documents and present ranked documents to the user. We
can see that implicit feedback neither asks for the user's explicit relevance
judgment nor categorically assumes that top ranked documents of
baseline retrieval are relevant. Instead, implicit feedback
intelligently infer the user's information need through those hints
implicitly provided by the user. However, there is a caveat for
implicit feedback. We need carefully analyze those hints and do not
incorporate noise into the new query, which may even hurt the retrieval
performance. Read the paper
Context-Sensitive Information Retrieval Using
Implicit Feedback for
more discussion and references.
To summarize the difference of these
three feedback techniques, relevance feedback asks the user explicit
relevance judgment; pseudo feedback assumes top ranked document of
baseline retrieval are relevant; implicit feedback tries to better infer
the user's information need through the data implicitly provided by the
user.
Active feedback was proposed in the paper
Active Feedback in Ad-hoc Information Retrieval.
Active feedback can be considered as a kind of relevance feedback. But
traditional relevance feedback focuses on how to incorporate judged
document into the new query (e.g., query term addition and query term reweighting),
while active feedback studies which documents should be presented to the
user for relevance judgment in order to maximize the learning benefits
of the retrieval system from the user judgment. A general framework was proposed in the paper and several specific
algorithms were deduced from the framework.
Motivation for Personalized Search - UCAIR (13) [tag: difficult
query, UCAIR]
October 20, 2005
In research papers or presentations,
people often use ambiguous queries for the motivation of contextual or
personalized search. Often used ambiguous query examples are
"bass" (fish or instrument), "java" (programming language, island or
coffee), "jaguar" (animal, car and Apple software) and "IR application"
(Infrared application or Information Retrieval application).
These ambiguous queries are really
one motivation for contextual search. However, the motivation of
contextual search is not limited to the query disambiguation. In my
SIGIR 2005 paper, I showed that for 30
hard topics selected from TREC (Text REtrieval Conference) topics 1-150,
the search needs to be put in context. These topics are called
hard topics because previous experiments show that they have very poor
retrieval performance using traditional retrieval algorithms. When I
look through these hard topics, I can see most of topics are hard not
because they are ambiguous. Instead, these topics are inherently hard
because1) it is very hard for the user to specify the information
needs clearly since the description of these topics is very
complex; 2) it is very hard for the retrieval system to find relevant
documents since there are very few relevant documents among the huge document collection. We demonstrate that using context information
(query history and clickthrough data), we can improve retrieval
performance. Here is an example of those hard topics. Each TREC
topic is composed of topic number (unique ID), title, description, and
narrative.
<topic>
<number> 2
<title> Acquisitions
<desc> Document discusses a currently proposed acquisition involving
a U.S.
company and a foreign company.
<narr> To be relevant, a document must discuss a currently proposed
acquisition (which may or may not be identified by type, e.g.,
merger, buyout, leveraged buyout, hostile takeover, friendly
acquisition). The suitor and target must be identified by name; the
nationality of one of the companies must be identified as U.S. and
the nationality of the other company must be identified as NOT U.S.
</topic>
For this topic, the
description of information need is very complex and there are a lot of
constraints. Moreover, there are only 283 relevant documents in the
whole document collection (this TREC collection has 242918 documents.). Here is a real
query sequence (4 queries in a sequence) submitted by a single
user and the corresponding poor retrieval performance. MAP means
Mean Average Precision, which is a good (but not intuitive) measure for
the overall retrieval performance and Pr@20docs means how much percentage of top 20 documents are relevant, which is a good measure for
the web search performance since many users only care about
the relevance of top ranked results.
First query: acquisition u.s.
foreign company
MAP: 0.004; Pr@20docs: 0.000
Second query: acquisition merge takeover u.s. foreign company
MAP: 0.026; Pr@20docs: 0.100
Third query: acquire merge foreign abroad international
MAP: 0.004; Pr@20docs: 0.050
Fourth query: acquire merge takeover foreign european japan
MAP: 0.027; Pr@20docs: 0.200
To summarize, query disambiguation is
one motivation of contextual or
personalized search. However, it is not the only motivation. For information
seeking activities for hard topics, we also need to put the search
in context.
Two Patents about Search Engine Personalization - Industry Series (10)
[tag: patent, industry]
October 2, 2005
There are two patent applications
related with the search engine personalization.
One is from Google,
Variable personalization of search results in a
search engine,which was
demonstrated somewhere on the Google website before, although it had
disappeared. The basic idea is to have a slider button for the user to tune
the degree of personalization. Here is the abstract of the patent
application.
This invention would
enable a searcher to fill out a profile, perform a normal search, and
then use a slider button to indicate how much his or her personal
information from the profile should be used to modify (rerank) that
search based upon the personalization information that they have entered
into the profile, by sliding the button partially, or all the way to a
full influence on the results.
The other is from
Yahoo! Color Graphing and Personalization.
Here is the abstract of the patent
application.
In a search
processing system, identifying input authority weights for a plurality
of pages, wherein an input authority weight represents a user's weight
of a page in terms of interest; distributing a page's input authority
weight over one or more pages that are linked in a graph to the page;
and using a resulting authority weight for a page in effecting a search
result list. The search result list might comprise one or more of
reordering search hits and highlighting search hits.
Some
Discussion about Thorsten's ACM SIGIR 2005 Paper - SIGIR
Series (4) [tag: click bias, relative relevance]
October 1, 2005
Jakob Nielsen has an
article
about Thorsten's ACM SIGIR 2005 paper (Visit
September 3
more information about this paper), which
spurs some
discussion at the Cre8site Forum.
It is interesting to read the discussion about how to do user search
behavior research in an unbiased way and some research findings of this paper.
Vivisimo teams with MSN for FirstGov.gov - Industry Series (8)
[tag: Vivisimo, industry]
September 30, 2005
Vivisimo teams with MSN to provide the
search technology for U. S. government
FirstGov.gov portal, which is reported
in the
09/26/2005 article of Search Engine
Watch. Compared with well-exposed Google activities,
which always attract media attention, even when it is about the new business of ex-chef
of Google (see
Google to Noodles: A Chef Strikes Out on His Own
from New York Times) and hiring activity of
some new chefs (see
Wanted at Google: A few good chefs from News.com), the report about
this event is relatively minimum.
Vivisimo has interesting technologies
to do search engine result clustering. Raul Valdes-Perez, CEO of Vivisimo thinks
that the personalization is a dead end and had written an
article about it, which I do not
agree in general. The problems he mentioned in the article
had been addressed or are being addressed in the personalization research.
A New
Version of UCAIR Toolbar - UCAIR Series (11) [tag: UCAIR]
September 22, 2005
There is a new version of UCAIR toolbar,
which can be downloaded from the
UCAIR project website. This version is
rewritten by Bin nearly from scratch. We redesigned the software
architecture of UCAIR toolbar, which aims to be extensible and robust.
A
Seminar Course about Search Engines in SIMS, Berkeley - Academia Series
(1) [tag: seminar, Marti Hearst]
September 21, 2005
There is a seminar course (Search
Engines: Technology, Society, and Business) offered in
SIMS, Berkeley in fall 2005. From the course website, it is said "A set of top-notch experts have agreed to
give lectures for fall 2005." Among them, Dr. Susan Dumais from Microsoft Research and Dr. Sepandar Kamvar (co-founder of Kaltrix) from Google will give lectures.
Both of them are doing personalized search. Thus the topics of them probably are related
with the personalized search. The slides and videos for some talks are
available at the website.
Personalized Search Papers at ACM CIKM 2005 - CIKM 2005 Series (1) [tag:
CIKM, UCAIR, Y!Q]
September 20, 2005
CIKM 2005, one of top information
retrieval research conferences, will be held in Bremen, Germany from October 31st to
November 5th. The last session of this conference is about context and
personalization. There will be three paper presentations in this session:
Context Modeling and Discovery Using Vector Space Bases
by Massimo Melucci (University of Padua)
Y!Q: Contextual Search at the Point of Inspiration by Reiner Kraft,
Farzin Maghoul, Chi Chao Chang (Yahoo! Inc.)
Implicit User Modeling for Personalized Search by Xuehua Shen, Bin Tan,
Chengxiang Zhai (CS, UIUC)
For the Y!Q paper, the
blog of the first author Reiner Kraft
explains the new feature of Y!Q. When you read a web page and are
interested in some phrases or a sentence, you can mark them and
trigger the search. Actually this functionality appeared in the defunct IntelliZap system (See
WWW 2001 paper).
Search Engine Web APIs - Industry Series (7) [tag: Web API, industry]
September 19, 2005
Google Web API provides
a way for programmers to develop interesting search related
applications utilizing the power of Google search engine. But currently
there are some limitations for programmer to develop a large-scale
application. I notice that there are at least two limitations. One is that
one account can at most submit 1000 requests one day and the other is
that for each query the user can only get at most 10 search results.
With these two limitations, the client-side programs can not get many
results frequently from Google through Google Web API and thus can not do
many interesting
processing such as result reranking at a large scale.
Yahoo Web API permits 5000
queries per IP per day and 50 search results per query. So Yahoo Web API is
friendlier to developers. Meanwhile, MSN is also preparing to release their Web APIs (see news from
News.com). Hope the competition will
boost the upgrade of Web APIs of all search engines in the near future,
which will benefit developers and eventually end users.
Notions of Personalization in Industry- UCAIR Series (10) [tag:
notion]
September 18, 2005 (China Mid-Autumn
Festival)
Besides personalized search engines in
industry, there are personalized portal and recommendation system,
which is briefly discussed as follows.
Personalized Portal: My Yahoo is the
pioneer in the personalized web portal, which includes personalized
news, weather forecast, comics, and TV listing. The user can customize
the personalized portal by setting user interested content, color,
layout and etc. Findory is a web site which provides the
personalized news service. Unlike My Yahoo, the user does not need
to explicitly specify the user interest. Instead, the web site implicitly
infers the user interests through the user interaction history on the
web site. The more user browsing history is collected, the better personalized
news articles selection is presented.
Recommendation System: Many E-Commerce web sites try to build personalized stores for each online customer.
Amazon is the most famous
one in building personalized web stores. They use collaborative
filtering techniques to recommend stuff for the customers according to product
purchased or viewed by customers before.
Notions of Personalization in Personalized Search Engine- UCAIR Series
(9) [tag: notion]
September 17, 2005
Web search engines have achieved great
successes in helping people find information on the Web, especially for
simple information need such as homepage finding. However, search
engines still perform poorly in many other tasks. There are many reasons
to cause the poor performance of the search engine. Among them, two
important reasons are frequently pointed out. First, many user queries
are ambiguous or the user himself does not know how to specify the
information need exactly. Thus the search engine can not infer the real
user information need just according to the current user query. Second,
information retrieval is an interactive process; users will adjust their
queries during this process. Therefore, the search engine should also
adjust the inference of user information need. Nevertheless, currently
most, if not all, search engines use only the user's current query to do
the search. Some search engine companies such as Google, MSN Search and
Yahoo are trying to use contextual and personal information to help the search. Some search engines have already released the test
version of personalized search such as Google. Yahoo co-founder Jerry
Young said that the relevance of search is still the Holy Grail for any
search application and the key challenge for Yahoo and all search
companies going forward will be to find ways to increase the
personalization of results, i.e., making sure that a user truly finds
what he or she is looking for when typing in a keyword search.
Notions of Personalization in Human-Computer Interaction Community- UCAIR Series (8) [tag:
HCI, interface]
September 16, 2005
Currently, there is much interest in the
personalization of product interfaces. For example, mobile phones are
now sold with replaceable colored covers, e-commerce sites learn a user
preference, and word processors allow you to customize the menus and
tool bars. In an
HCI 2000 poster, the personalization is
defined as follows.
Personalization
is defined here as a process that changes the functionality, interface,
information
content, or distinctiveness of a system to increase its personal
relevance to an individual.
The motivation for personalization is divided into those that are
primarily to facilitate the work, e.g., bookmarking a web page, and those
that are primarily to accommodate social requirements, e.g., expressing
the identity of the user. HCI community focuses on how to model user
search behavior, what kind of user actions such as mouse moving are
related to user interests, and how the system can extract useful
information during the user interaction with the system to do
personalization. In an
IUI 2004 paper, the author studied the
correlation of four mouse operations and user interests and used these mouse
operations as the clue to extract some context keywords to do similarity
search.
Notions of Personalization in Information Retrieval Community- UCAIR Series (6) [tag:
information retrieval, contextual search]
September 14, 2005
Research in information retrieval has a
long history dating back to 1950's. Over decades, significant progress
has been made in developing retrieval models such as vector space model,
probabilistic model and recently statistical language model, performing
large scale empirical evaluation and building useful systems such as
SMART, Lemur and Google. Nevertheless, almost all existing retrieval
models and systems can be characterized as ``one size fits all". Only
user queries are used to represent user information need and there is no
representation of search context and user preference. Thus same queries
submitted by different users are treated as the exactly same. A great
amount of responsibility of finding relevant information is taken by the
user. However, the ideal retrieval system should proactively incorporate
both the user's search context and personal preference into the
retrieval decision process. In a recent
workshop about challenges in
information retrieval and language model, personalization and contextual
search is considered as one of two big challenges in information
retrieval. They define the contextual search as follows.
Contextual Search: Combine search technologies and knowledge about
query and search
context into a single framework in order to provide the most
``appropriate" answer for a user's
information needs.
However, despite recent attention to
this problem, little progress has been made due to the difficulty of
capturing and representing knowledge about the user, context and task in
the general web search environment. Although there are many studies of
retrieval models (by researchers of computer science) and user models
and user information seeking process (by researchers of information
science), the research in user model and retrieval model are currently
is not well integrated.
Participants of the workshop believe that the future search engine
should be able to collect use context and query features to infer
characteristics of the information need unobtrusively. A retrieval
framework integrating retrieval model and user model needs to be
proposed, studied and evaluated empirically.
Web
Browser, Search Engine and Toolbar - Industry Series (6) [tag: web
browser, industry]
September 12, 2005
The Web Browser is the most important
window to the immense information on Internet. There are
Internet Explorer (IE) (IE 7.0
is in beta testing),
Firefox
(more than 80 million downloads since its release on November 9, 2004),
Opera (it just celebrated its 10th anniversary on August 30, 2005),
Netscape (watch the drama of
browser war between Netscape/Firefox
and IE),
Safari, and
others. Developers can add new
functionalities into the web browser through add-ins such as Google toolbar.
The
Search Engine helps the user find
information on Internet. Google, Yahoo and MSN are dominant players in
Search Engine arena (watch the drama of Microsoft's
suit against Google and Kai-Fu Lee).
All of them offer IE toolbars, which help the user to search information
without visiting search engine homepage, and APIs, which help developers add new functionalities
based on those search engines.
UCAIR toolbar is an IE toolbar, which uses Google search engine search results as basic results. But
so far, it does not make use of Google APIs. However, it is a choice
under the consideration.
Information Sources of Search Engine Industry- Industry Series (5) [tag:
search engine watch, industry]
September 11, 2005
Search Engine Watch is a very popular
electronic daily to provide information about search technology. Their
staff also organizes the Search Engine Marketing conference several times
each year around the world. This is a primary place to read news about
search technology in industry. For each daily, they provide a lot of
interesting links about the news of search engine industry.
Search Engine Journal,
WebMaster
World,
Search Engine Show Down, and
Cre8site
are useful
electronic journals or forums to obtain news about search technology too.
Some interesting blogs are as follows,
Geeking with Greg
(with more technology flavor),
John Battelle's
Searchblog
(with more social flavor),
Google
Blog (official Google blog),
Yahoo Blog
(official Yahoo Search blog?),
and MSN
Search WebLog
(official MSN Search blog?).
Among these information bushes, we can
pick the personalized search berry, which is a hot area for most, if not
all, search engine companies.
Desktop Search Software and APIs - Industry Series (4) [tag: desktop
search]
September 10, 2005
Desktop Search software is the search
engine for the personal computer. There are
Google,
MSN,
Yahoo (or X1),
Copernic
Desktop Search, all of which have free versions. In order to
let developers add functionality to the fledgling Desktop Search
software,
Google and
MSN provide APIs. Google API
documentation looks comprehensive and has a nice developer discussion
group.
Using Desktop Search APIs, information
retrieval
researchers have more power to do research on the personalized search on
the client side. They can have access to the index structure of user
local files and build a better user model.
UCAIR
Personalized Search Toolbar - UCAIR Series (4) [tag: UCAIR]
September 9, 2005
In UCAIR project, we develop a
UCAIR Personalized Search Toolbar. The
software can be downloaded from UCAIR project website. Currently, UCAIR
Toolbar is an Internet Explorer plug-in and uses Google search results
as basic results. But it is a matter of engineering to integrate
it with other web browsers such as Firefox or use other search engines
such as Yahoo search results.
Compared with the personalization at the
search engine server side, personalization at the client side as UCAIR
toolbar does has the following advantages. 1) Privacy is a much less
concern. The user interaction history will be strictly kept at the
client side and the search engine can not store the information about
what you have viewed. 2) On the client side, there is much richer user
information than just keyword query and clickthrough data, which can be
used to better infer the user model. For example, the user local files
can be indexed to represent the user information interest. 3) The
computation and storage cost will be reduced on the search engine side.
The disadvantage I can see for the personalization at the client side is
that there is no global index for all web pages so that the client side
probably can not control the general retrieval function.
comScore has a report about
Search Engine rating
in July 2005. Not surprisingly, Google maintains the lead with
36.5% share of search following Yahoo (30.5%) and MSN (15.5%). But for
the search submitted from toolbar, Yahoo tops the share.
"Yahoo toolbars processed more than 282
million searches during the month, a 74-percent increase over the
previous year". A more interesting number related with the
personalization at the client side is the ratio of searches submitted
from toolbar over all searches. "In
July, 11 percent of all domestic searches were conducted via
toolbars, up from 8 percent in July 2004." From this number, we
can see that indeed a lot of searches (11%) are submitted from toolbar
and if the personalization functionality is added into the toolbar, this
percentage number is expected to increase since the user will see more
relevant web pages returned to them using the personalized search
toolbar to do search [See our
CIKM 2005 paper for a user study about
personalized web search].
Findory Personalized News - Industry Series (2) [tag: Findory, Personalized News]
September 8, 2005
I got to know there was a Findory website
around a year ago. I was excited to see that Findory provides
personalized news service. The idea behind the Findory is similar with
that of Amazon, i.e., collaborative filtering in recommendation
systems. Actually, the founder of Findory (Greg
Linden) worked in Amazon Personalization Group before and
had written a paper
Amazon.com Recommendations: Item-to-Item
Collaborative Filtering in IEEE Internet Computing, January
2003.
The collaborative filtering idea is
easily applied on the server side if the privacy is not a big concern.
Amazon has achieved great success using the same idea. Moreover,
the more user interaction data (e.g., keyword queries and viewed web pages) are
collected, the better personalization it is supposed to achieve. However,
for the personalization on the client side, which reduces privacy
concern greatly, it is hard to get interaction history of other people
and thus collaborative filtering technique can not be used in
personalized search on the client side. On the other side,
personalization techniques at the client side can exploit much richer
user information such as user desktop index to infer a better user model
and thus improve retrieval accuracy.
Compared with Findory, Google News has
the customization functionality, i.e., the user can select keyword query
terms he likes, e.g., "Personalized Search" and then relevant news
articles searched will be put in a new category. Moreover, the layout of
different categories of Google News can be changed by the user. I am not
sure when Google Personalized Search launched in late June will be
integrated with the Google News.
IRiX
Workshop at ACM SIGIR 2005 - SIGIR Series (3) [tag: IRiX, SIGIR
2005]
September 6, 2005
The second
IRiX
(Information Retrieval in Context) Workshop at ACM SIGIR Conference was
held on August 19, 2005. At
SIGIR
2004 in Sheffield, UK, the first
IRiX
workshop was organized by
Peter Ingwersen,
Keith
van Rijsbergen, and
Nick Belkin. The
proceedings and
workshop report of the first IRiX
workshop are available at SIGIR Forum. The second IRiX workshop was
organized by Peter Ingwersen,
Kalervo Järvelin, and Nick Belkin. The
program and
proceedings are also available.
At the workshop, besides several
presentations, there were very interesting discussion groups. I joined
the Group A Context, Situation, and Task: Implications for IR led
by Nick Belkin. We discussed what are context, situation and their
relationship. After the discussion, we consider situation as a snapshot
of values of context variables. There are user, system and environment
facets of context. We try to enumerate the lower level facets. See the
Group Notes for the discussion results.
Interestingly, there is another
IRiX
workshop held in Glasgow in July 2005, which was organized by
Joemon
Jose and Keith van Rijsbergen. I think the next important steps for IRiX
research in academia include preparing an evaluation data set (there is
a test
data set available for contextual
search from UCAIR project, but far from enough), proposing the retrieval
model to incorporate the context information, and making a real system to
demonstrate the power of search in context.
Jaime
Teevan's Personalized Search Paper - SIGIR Series (2) [tag: Jaime
Teevan, Susan Dumais, SIGIR 2005]
September 5, 2005
At ACM SIGIR 2005,
Jaime Teevan presented her work of
Personalizing Search via Automated Analysis of
Interests and Activities with
Susan Dumais and
Eric Horvitz. This is a very interesting and very solid work
for personalized search. In their problem setting, they do personalized
search at the client side and use result reranking to do
personalization. Their reranking formula is borrowed from the relevance
feedback. However, the feedback documents are implicitly rather than
explicitly provided by the user. They do a study about how to get the
statistics of the formula such as tf and idf. There are several
variations of collection/corpus representation, user representation, and
document and query representation. Two interesting finding are that
the performance of the personalized results purely based on user
profiles is actually worse than that of original web search results and
that the mixture of the personalized ranking and original web ranking is
needed to achieve better results.
This work can be considered as the work
using long-term context to improve the retrieval accuracy. It is not
clear how the work selects recent index. In my opinion, the context
information in the same information seeking session is considered as the
short-term context, which maybe is more important than long-term context
for improving the search results. One problem of short-term context is
that they are generally very sparse. However, the long-term context can
be used for the smoothing. I think that one interesting
future work is to explore to how to differentiate user profiles stored
on the local disks instead of treating them equally and use them in a finer granularity.
Google Personalized Search - Industry Series (1) [tag: Google
Personalized Search, industry]
September 4, 2005
Google Personalized Search was launched
on June 28, 2005, which was reported in
Search Engine Watch Blog. Before that,
there was another
Google Personalized Search, which asks
each user to create a profile explicitly (e.g., which category information (e.g.,
kids and sports) you prefer to), and then personalizes the search
results according to categories you select. The new Google
Personalized Search instead uses user interaction history as implicit feedback
information to infer the user interest and personalize the search
results. They store the user query history and clickthrough data on
Google servers. When the user submits a query, the query history and
clickthrough data are used to personalize the search results. No details
about how they utilize the user interaction history in personalization
are provided. It is supposed that the more user interaction history data are
collected, the more personalized relevant search results are returned.
Charlene Li has a
blog to discuss the launch of Google
Personalized Search. Several issues such as privacy and possible
improvements mentioned in the blog are actually addressed in UCAIR
project (see the next paragraph).
Geeking with Greg has several thoughts
about Google Personalized Search too. One issue pointed out in the blog
is that Google personalized search does not provide the user the
information about which results are pushed up by Personalized Search
(UCAIR project does tell the user which results are pushed up by
personalization) and why some results are pushed up by personalization.
Compared with Google Personalized Search,
UCAIR project personalizes the search result on the client side. The
user interaction history is totally stored on the client side so that
the privacy is not an issue as it is for Google Personalized Search.
Moreover, putting personalized search on the client side, we can utilize
more user information on the client side to do further personalization.
For example, we can make use of the local files on the user's hard disk
and bookmarks of web browser as long-term user interest to personalize
the search results. It is expected that some search engine toolbars will
introduce such functionality in the future.
Thorsten Joachims' Implicit Feedback Paper - SIGIR Series (1)
[tag: Thorsten Joachims, clickthrough data]
September 3, 2005
At SIGIR 2005,
Professor Thorsten Joachims and his coworkers have a very
good paper,
Accurately Interpreting Clickthrough Data as
Implicit Feedback. In this work, they study how the user
clickthrough data is reliable as the source of implicit feedback. They
do two related user studies. One user study utilizes eye tracker to
record user eye movement to infer the user browsing behavior (viewing,
clicking and the relationship between them). The other user study asks
the participants to explicitly judge relevance of search results so that
the correlation between implicit feedback and explicit relevance
feedback can be studied. The main finding of this work is that the
relative relevance judgment (e.g., one search result is more relevant
than another search result) rather than absolute relevance is more
accurate.
This paper mainly conducts the user
study to get some insight into the clickthrough data
as implicit feedback. They have a related paper
Optimizing Search Engines Using Clickthrough Data
in SIGKDD 2002, which talks about using SVM to train a retrieval ranking
function according to clickthrough data. In the SIGKDD 2002 paper, they
already used the idea of using relative relevance instead of absolute relevance.
A follow-up paper of SIGIR
2005 paper is
Query Chains: Learning to Rank from Implicit
Feedback in SIGKDD 2005. They used clickthrough data in
training a retrieval function of a search engine. The SIGKDD 2005 paper
can be considered as a combination of the SIGKDD 2002 and SIGIR 2005
paper.
This work can also be considered
as a finer-granularity study of how to make use of clickthrough data in
contextual search. Our paper Context-Sensitive
Information Retrieval Using Implicit Feedback in SIGIR 2005
is focused on how to model clickthrough data into the contextual search.
SIGIR
2005 Papers from UCAIR Project- UCAIR Series (2) [tag: UCAIR, active feedback, implicit
feedback]
September 2, 2005
There are two papers from UCAIR project
published in
SIGIR
2005 . The title of one paper is
Active Feedback in Ad-hoc Information Retrieval
by Xuehua Shen and
ChengXiang Zhai. This work studies how to do document
selection for user relevance judgment if the user is willing to judge
relevance of some documents, while the traditional relevance feedback is
focused on how to do query term expansion and query term reweighting
given user judged feedback documents. We propose a preliminary
framework and several active feedback methods in this paper; the title
of the other paper is
Context-Sensitive Information Retrieval Using
Implicit Feedback by Xuehua Shen, Bin Tan and ChengXiang
Zhai. This work studies how to model user interaction history (implicit
feedback) to improve retrieval accuracy. We propose four statistical
contextual language models to incorporate context information.
CIKM
2005 Paper from UCAIR Project - UCAIR Series (1) [tag: CIKM 2005, motivation of
personalization]
September 1, 2005
A common major limitation of existing
retrieval models and systems is that the retrieval decision is, in
general, based solely on the query and document collection; information
about the actual user and the search context is largely ignored. This
limitation makes the retrieval performance of existing IR systems
inherently non-optimal, as seen clearly in the following two cases:
- Different users may use exactly the
same query to search for different information, but existing IR
systems return the same results for these users. For example, the
query ``IR applications'' on Google returns a mixture of documents
about ``information retrieval'' applications and ``infrared''
applications, as ``IR'' can be an acronym for both information
retrieval and infrared. Without considering the actual user it is
inherently impossible to know which sense ``IR'' refers to.
- A user's information needs may
change over time. The same user may sometimes use ``java'' to mean
the Java island and some other times use ``java'' to mean the
programming language. Without recognizing the search context, it
would be again inherently impossible to recognize the correct sense.
It is therefore clear that an optimal
retrieval system must incorporate both user information and search
context into the retrieval decision process.
The UCAIR ( pronounced as "you care",
means User Centered Adaptive Information Retrieval) project seeks to
break this limitation of the existing retrieval methods and formally
develop a new retrieval paradigm called user-centered adaptive
information retrieval (UCAIR), in which user information and search
context are both exploited to improve retrieval performance.
Here is a paper which will be published
in
CIKM
2005
. The paper is
Implicit User Modeling for Personalized Search
by Xuehua Shen, Bin Tan, and ChengXiang Zhai. There are two main
contributions of this paper. One is to propose a decision theoretic
framework and develop techniques for implicit user modeling in
information retrieval. The other is to develop and evaluate a
client-side personalized search agent
UCAIR.
|