CS410
Assignment #4: Web Search Engine
Due:
April 9th 2008 in class
Introduction
The goal of this assignment is to improve the search engine performance. You
will be using the light version of lemur CGI for this
assignment. You can download the code from here.
The Lemur CGI is a CGI executable that runs under a HTTP server (web
server) that allows access into indices and general search capabilities. The CGI
files will be installed in /site-search/cgi. Be sure that your webserver
configuration will allow executables to be run.
This assignment involves an important type of search: informational
- Informational search:
The query describes a general topic, e.g. "information retrieval", the
system should return the web pages that discuss "information retrieval".
The assignment has five stages.
- Crawl data: You will be
using "wget" command to crawl data. Type "man wget" in a
Linux machine to know how it works. Each of you is given one Computer
Science department home page to crawl. More information about it can
be found here.
- Index your data with Lemur:
Index your crawled data
- Warming Up: Set up the
search engine on the csil-linux machine
(or your own machine).
- Evaluation: Implement
new methods and get your system ready. You should
think of a way to improve the performance of the search engine, implement
the method and write a report to explain the method. Then you
should set up a system with the best performing method (or combination of
methods).
- Query Contribution:
Each of you is required to submit one query (informational). Please take this chance to think of some "difficult queries" that your
search engine can do well.
Here is the format of the query file. (Your submitted query should follow
this format.)
- ID --- the query id
- title
--- a set of keywords that should be used to search documents with your
search engine.
- desc --- a detailed
description of the query or detailed criteria for relevant documents
The IDs of the submitted query should be "yourNETID".
Getting started
First, look at here to see which university
home page you are assigned
to download. Note that you should only crawl Computer Science department, that
is, you go to the computer science department of the university you are assigned
to and start crawling from there.
Second, for crawling data, you need to use "wget" command. More information
about the command can be found here
, or you can type "man wget" in Linux machines to get more information of how
this command works for crawling data. You only need to crawl html files.
Some of the important options are:
- "-r" : Turn on recursive retrieving. When
using this option, it will create a hierarchy of directories. Recursive
retrieval of http and html
content is breadth-first. This means that Wget first downloads
the requested html document, then the documents
linked from that document, then the documents linked by them, and so on. In
other words, Wget first downloads the documents at depth 1, then those at
depth 2, and so on until the specified maximum depth. The maximum depth
to which the retrieval may descend is specified with the `-l'
option. The default maximum depth is five layers.
- "-nd":
Do not create a hierarchy of directories when
retrieving recursively. With this option turned on, all files will get saved
to the current directory, without clobbering (if a name shows up more than
once, the filenames will get extensions `.n').
- "-R
rejlist" : if you want to
download a whole page except for the cumbersome mpegs
and .au files, you can use "wget
-R mpg,mpeg,au".
- "-A
acclist --accept
acclist":
Specify comma-separated lists of file name suffixes or patterns to accept
You should be warned that recursive downloads can
overload the remote servers. Because of that, many administrators frown upon
them and may ban access from your site if they detect very fast downloads of big
amounts of content. When downloading from Internet servers, consider using the `-w'
option to introduce a delay between accesses to the server. The download will
take a while longer, but the server administrator will not be alarmed by that.
So, for our purpose, create a directory called, e.g, crawledData, go to that
directory and type the following command:
wget -R extenstions_which_are_not_necessary url_of_CSdepartment -r -nd
For example, if you want crawl our CS department, you can do the following:
wget -R gif,jpg,jpeg http://www.cs.uiuc.edu -r -nd
(note that in rejection list, you may want to list other extensions which are
not necessary, be advised that we only want to download "html" pages
not pdf nor ppt etc.)
Another way of doing it, is to work with accept list:
wget -A "*.html,*.htm" http://www.cs.uiuc.edu/ -r -nd --html-extension
--random-wait
Third, you should build an index for your data, do the following steps:
- Go to the root directory of Lemur and type ./configure and then gmake.
- You may want to create a folder called
crawledData in your home directory or other locations containing all HTML files that you have
crawled. Then, you need to preprocess these html files to follow the Lemur
format. Because we will use Lemur to index these documents, so they should
follow Lemur format. For each html file you need to add the following lines at the
top and the end of the document.
<DOC>
<DOCNO>
document number (this number should be unique) </DOCNO>
.......
</DOC>
So, you need to add two
lines at the top of each html file and one line at the end of the html file.
You should write a script to do
the above.
- If you have created a folder called
crawledData in your home directory containing all HTML files that you have
crawled, you can use BuildDocMgr in /app/obj/BuildDocMgr. The parameter file
is called cgi_param and is located in /data/cgi_param. You need to edit this
file, the fields you need to edit are as follows:
<index> is a absolute path to the location you want
to put your index (that is you indicate the path you wish to build the index),
at the end of the path you should name your index, e.g, pindex.
For
example, if I want to place my index in index directory in home,
the absolute path would be: /home/index/pindex
<manager>
is the same path as above
<dataFiles>absolute
path to your data file. That is, you may want to create a "df" file containing
the absolute path to your html files, one line per each html file that you have
placed in crawledData.
for
example, if you have three files in crawledData directory named file1,
file2, file3, in your "df" file, you should have three lines like:
/home /crawledData/file1
/home/crawledData/file2
/home/crawledData/file3
For more information about
formatting, read this article.
Then to
index your data, you need to go to /app/obj directory and type in:
./BuildDocMgr
absolute-path-to-your-cgi_param
At the end, you need to set up your search engine, do the following steps:
- Go to /site-search/cgi directory. The most important .cpp file you will
be using is DBInterface.cpp. There is a "search" method in that file which
calls "retrieve" method.
The two other methods which are familiar to you are computeWeight and
computeAdjustedScore. These two methods are empty. For improving the
performance of your Web search engine, you may want to change these methods or
"search" method. Whenever you change this .cpp file, just type gmake.
- Whenever you type "gmake", in "bin" directory, you will see "lemur.cgi"
which is the file that is used for web search. There is also another important files in "bin" directory
such as "lemur.config" which contains the index path once you indexed your
data.
The configuration file (lemur.config) is a well-formed XML file with the
opening tag <lemurconfig>. There are two required elements within the
configuration file:
<templatepath>: this should reflect the path (either relative or
absolute) to the template files. template files are in bin directory.
<indexes>: this section contains information about what indexes are
available, and can contain as many indexes as needed. For each <index> item,
there should be two elements. First, a <path> element must be set pointing
at where the index is located. Secondly (and optionally), a <description>
tag can be set to be a description of the pointed index. The path should be
the full path to the index location. (Note that, we only need to use one
index). Note that you should give the absolute path to your KEY index, e.g,
"pindex.key" and everybody should be able to open your
index and execute your cgi script. Change the mode of your index file by using:
chmod 755 name_of_your_index-file
755 is the executable mode to every body.
Also lemur.cgi and lemur.config should be in 755 mode to be executable to
every one.
If you wish to use the default HTML templates, no modifications are
necessary, but if you want to modify the HTML templates for your own uses,
be sure to read the "README_Templates.txt" file for instructions on
available commands that you can use within the templates.
- All you need to do is to copy all the contents in cgi directory to
your CSIL machine in your home directory, csil-projects/cgi-bin folder.
Read this document (web space) to
see where you should place your cgi and html files.
- once you placed the required files in cgi-bin
folder in your CSIL machine, you can see your web page through a web browser by using the
following address:
http://csil-projects.cs.uiuc.edu/~netid/cgi-bin/file_in_cgi-bin
where netid is your netid and
file_in_cgi-bin is your cgi file, e.g, lemur.cgi
- If you have done the previous steps correctly, you
can see a page with
"Lemur Search" title and a box for search which you can search.
You need to fill in those motioned methods first in order to be able to see
the results.
Tasks
- [30 points] Crawl data
- [15 points] Index your
data
- [45 points] Set up the
search engine according to the above instructions. Get some sense about
it and how you can possibly improve the performance or
functionality. Think of a method to improve the performance of the current
search engine, implement your method and write a brief report to explain
your method and results. The improvement can be along the following
directions.
- New retrieval methods
to improve the retrieval performance, such as PageRank etc.
- Different search
interface
- New functionalities,
such as spelling correction, duplicate pages detection
- Any other new ideas
We will grade your answers based on the novelty of the
methods and the performance.
- [10 points] Provide one
informational query.
What to turn in
A. Implementation:
- Turn in a hardcopy of your
report (at most one page) at the class. The report should include the
method you used and your observations
- Give the absolute path of
your crawled data.
- Pack all your **modified and
new** codes into one single zip file or tar file, name it as
"assign4-YourNetID" and send it to "maryamcs410 AT gmail DOTcom".
- You should set up a
search engine and turn in the url of the search engine along with a short description by sending email to
"maryamcs410 AT gmail DOT com".
B. Query Contribution: Turn in your query. Send email to "maryamcs410 AT gmail
DOT com".