CS410 Assignment #4: Web Search Engine

Due: April 9th 2008 in class

Introduction

The goal of this assignment is to improve the search engine performance. You will be using the light version of lemur CGI for this assignment. You can download the code from here.

The Lemur CGI is a CGI executable that runs under a HTTP server (web server) that allows access into indices and general search capabilities. The CGI files will be installed in /site-search/cgi. Be sure that your webserver configuration will allow executables to be run.
 

This assignment involves an important type of search: informational

  • Informational search: The query describes a general topic, e.g. "information retrieval", the system should return the web pages that discuss "information retrieval".

The assignment has five stages.

  1. Crawl data: You will be using "wget" command to crawl data. Type "man wget" in a Linux machine to know how it works. Each of you is given one Computer Science department home page to crawl. More information  about it can be found  here
  2. Index your data with Lemur: Index your crawled data
  3. Warming Up: Set up the search engine on the csil-linux machine (or your own machine).
  4. Evaluation: Implement new methods and get your system ready. You should think of a way to improve the performance of the search engine, implement the method and write a report to explain the method. Then you should set up a system with the best performing method (or combination of methods).
  5. Query Contribution: Each of you is required to submit one query (informational). Please take this chance to think of some "difficult queries" that your search engine can do well.

    Here is the format of the query file. (Your submitted query should follow this format.)

    • ID --- the query id
    • title --- a set of keywords that should be used to search documents with your search engine.
    • desc --- a detailed description of the query or detailed criteria for relevant documents

    The IDs of the submitted query should be "yourNETID".

Getting started

First, look at here to see which university home page you are assigned to download. Note that you should only crawl Computer Science department, that is, you go to the computer science department of the university you are assigned to and start crawling from there.

Second, for crawling data, you need to use "wget" command. More information about the command can be found here , or you can type "man wget" in Linux machines to get more information of how this command works for crawling data. You only need to crawl html files.

Some of the important options are:

  • "-r" : Turn on recursive retrieving. When using this option, it will create a hierarchy of directories. Recursive retrieval of http and html content is breadth-first. This means that Wget first downloads the requested html document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth. The maximum depth to which the retrieval may descend is specified with the `-l' option. The default maximum depth is five layers.
  • "-nd": Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions `.n').
  • "-R rejlist" : if you want to download a whole page except for the cumbersome mpegs and .au files, you can use "wget -R mpg,mpeg,au". 
  • "-A acclist --accept acclist": Specify comma-separated lists of file name suffixes or patterns to accept
     

You should be warned that recursive downloads can overload the remote servers. Because of that, many administrators frown upon them and may ban access from your site if they detect very fast downloads of big amounts of content. When downloading from Internet servers, consider using the `-w' option to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by that.

So, for our purpose, create a directory called, e.g, crawledData, go to that directory and type the following command:

wget -R extenstions_which_are_not_necessary  url_of_CSdepartment -r -nd

For example, if you want crawl our CS department, you can do the following:

wget -R gif,jpg,jpeg http://www.cs.uiuc.edu -r -nd

(note that in rejection list, you may want to list other extensions which are not necessary, be advised that we only want to download "html" pages not pdf nor ppt etc.)

Another way of doing it, is to work with accept list:

wget -A "*.html,*.htm" http://www.cs.uiuc.edu/ -r -nd --html-extension --random-wait

Third, you should build an index for your data, do the following steps:

  • Go to the root directory of Lemur and type ./configure and then gmake.
  • You may want to create a folder called crawledData in your home directory or other locations containing all HTML files that you have crawled. Then, you need to preprocess these html files to follow the Lemur format. Because we will use Lemur to index these documents, so they should follow Lemur format. For each html file you need to add the following lines at the top and the end of the document.

           <DOC>

            <DOCNO> document number (this number should be unique) </DOCNO>

            .......

            </DOC>

         So, you need to add two lines at the top of each html file and one line at the end of the html file.

        You should write a script to do the above.

  • If you have created a folder called crawledData in your home directory containing all HTML files that you have crawled, you can use BuildDocMgr in /app/obj/BuildDocMgr. The parameter file is called cgi_param and is located in /data/cgi_param. You need to edit this file, the fields you need to edit are as follows:

            <index> is a absolute path to the location you want to put your index (that is you indicate the path you wish to build the index), at the end of the path you should name your index, e.g, pindex.

            For example, if I want to place my index in index directory in home, the absolute path would be: /home/index/pindex

            <manager> is the same path as above

            <dataFiles>absolute path to your data file. That is, you may want to create a "df" file containing the absolute path to your html files, one line per each html file that you have placed in crawledData.

            for example, if  you have three files in crawledData directory named file1, file2, file3, in your "df" file, you should have three lines like:

                              /home /crawledData/file1

                              /home/crawledData/file2

                             /home/crawledData/file3

        For more information about formatting, read this article.         

       Then to index your data, you need to go to /app/obj directory and type in:

        ./BuildDocMgr absolute-path-to-your-cgi_param

At the end, you need to set up your search engine, do the following steps:

  • Go to /site-search/cgi directory. The most important .cpp file you will be using is DBInterface.cpp. There is a "search" method in that file which calls "retrieve" method. The two other methods which are familiar to you are  computeWeight and computeAdjustedScore. These two methods are empty. For improving the performance of your Web search engine, you may want to change these methods or "search" method. Whenever you change this .cpp file, just type gmake.
  • Whenever you type "gmake", in "bin" directory, you will see "lemur.cgi" which is the file that is used for web search. There is also another important files in "bin" directory such as "lemur.config" which contains the index path once you indexed your data.

    The configuration file (lemur.config) is a well-formed XML file with the opening tag <lemurconfig>. There are two required elements within the configuration file:

    <templatepath>: this should reflect the path (either relative or absolute) to the template files. template files are in bin directory.

    <indexes>: this section contains information about what indexes are available, and can contain as many indexes as needed. For each <index> item, there should be two elements. First, a <path> element must be set pointing at where the index is located. Secondly (and optionally), a <description> tag can be set to be a description of the pointed index. The path should be the full path to the index location. (Note that, we only need to use one index). Note that you should give the absolute path to your KEY index, e.g, "pindex.key" and everybody should be able to open your index and execute your cgi script. Change the mode of your index file by using:

        chmod  755 name_of_your_index-file  

        755 is the executable mode to every body.

    Also lemur.cgi and lemur.config should be in 755 mode to be executable to every one.

    If you wish to use the default HTML templates, no modifications are necessary, but if you want to modify the HTML templates for your own uses, be sure to read the "README_Templates.txt" file for instructions on available commands that you can use within the templates.

     

  • All you need to do is to copy all the contents in cgi directory to your CSIL machine in your home directory, csil-projects/cgi-bin folder. Read this document (web space)  to see where you should place your cgi and html files.
  • once you placed the required files in cgi-bin folder in your CSIL machine, you can see your web page through a web browser by using the following address:

            http://csil-projects.cs.uiuc.edu/~netid/cgi-bin/file_in_cgi-bin
 

        where netid is your netid and file_in_cgi-bin is your cgi file, e.g, lemur.cgi

  • If you have done the previous steps correctly, you can see a page with "Lemur Search" title and a box for search which you can search. You need to fill in those motioned methods first in order to be able to see the results.

       

Tasks

  1. [30 points]  Crawl data
  2. [15 points]  Index your data
  3. [45 points] Set up the search engine according to the above instructions. Get some sense about it and how you can possibly improve the performance or functionality. Think of a method to improve the performance of the current search engine, implement your method and write a brief report to explain your method and results. The improvement can be along the following directions.
    • New retrieval methods to improve the retrieval performance, such as PageRank etc.
    • Different search interface
    • New functionalities, such as spelling correction, duplicate pages detection
    • Any other new ideas

    We will grade your answers based on the novelty of the methods and the performance.

  4. [10 points] Provide one informational query.

What to turn in

A. Implementation:

  • Turn in a hardcopy of your report (at most one page) at the class. The report should include the method you used and your observations
  • Give the absolute path of your crawled data.
  • Pack all your **modified and new** codes into one single zip file or tar file, name it as "assign4-YourNetID" and send it to "maryamcs410 AT gmail DOTcom".
  • You should set up a search engine and turn in the url of the search engine along with a short description by sending email to "maryamcs410 AT gmail DOT com".

B. Query Contribution: Turn in your query. Send email to "maryamcs410 AT gmail DOT com".