CS410 Assignment #2: Set up a Forum Search Engine
(due Sunday, Feb 17, 2013, 11:59pm, extended)

The goal of this assignment is to set up a forum search engine. It is an individual assignment. You will be asked to crawl an Internet forum of interest to you, and then set up a search engine to support searching the crawled forum. You will be given an experimental search engine toolkit (Apache Lucene plus Solr), though you’ll need to configure it and process the data to actually make the search engine work appropriately. Finally, you’ll be asked to come up with a sample query with relevant documents in your crawled forum data, and make relevance judgments on a small number of top-ranked search results. Your queries and relevance judgments will be pooled together to be used in the next assignment, which is a group assignment of search engine competition.

In this homework, you need to use the cs410 server provided by the department. So before you start your homework, make sure that you can access cs410-server.cs.illinois.edu server through ssh with your NedId and password:

ssh YourNetID@cs410-server.cs.illinois.edu
Enter the password you just retrieved to log in. Please make sure you have the access before you go on with your homework.

You may also need to remote copy files to the server. To do so, you can run the following command on your local machine (where your files reside):

scp YourLocalFile YourNetID@cs410-server.cs.illinois.edu:/home/cs/YourNetID/
or
scp -r YourLocalFolder YourNetID@cs410-server.cs.illinois.edu:/home/cs/YourNetID/
if you have a folder of files to upload. The file/folder will be put under your home folder /home/cs/YourNetID. You can also specify a different destination, e.g. /home/cs/YourNetID/Temp/, as long as the folder exists on the server and you have the permission to write it.

Task 1: Crawling [30 points]

  1. Find Data Sources

    You have two TAs in this course, and they are two big fan of sports and recreations respectively. They use forums a lot and they are looking forward to having their own search engine for sports and recreation forums.

    Big Boards has a great collection for different forums. You can find your choice of forums from the "Recreation" and "Sports" categories or instead you can use any other forums as long as they are relevant to "recreation" and "sports". There are a lot of subcategories for you to explore such as "basketball" and"baseball" in "sports" category. When selecting your forum, bear in mind that it should contain ample information for at least 1 question you find on the question answering website Yahoo! Answers. (You can choose the question freely --- There are a lot of people like your TAs are fans of sports and recreations.) The question you choose will be used in Task 3, and you may want to read the description of that question before finalizing which forum you would crawl.) To prevent duplicatework, please collaborate on maintaining the Forum List Signup page according to the instruction provided. Your choice of forums and questions should NOT be the same as others listed on the page so be the first one to claim YOUR forum.

  2. Crawl the Forum

    There are two options for crawling the data. It is fairly simple to use Wget to crawl the data, but if you are interested, you are also encouraged to build your own crawler. Notice: since we later will combine all data you crawl, so each of you does not need to crawl too much data. The data crawled from one forum should NOT exceed 30MB, which means in most cases you only need to crawl part of the forums. You can control the amount of data that you crawl by using an appropriate option of Wget (please see descriptions of the options of Wget below).

    Option 1: Using Wget

    Wget is a general purpose crawler installed on almost all Linux distributions. Type "man wget" on a linux machine to see how it works, or view the documentation here. Some of the important options are:

    EXAMPLE:

    
    mkdir html
    cd html
    wget -nv -R gif,jpg,jpeg http://forum.doom9.org/ -r -nd --html-extension --random-wait -Q30m -o wget_log
    cd ..
    

    CAUTION:

    1. You should be warned that recursive downloads can overload the remote servers. Because of that, many administrators frown upon them and may ban access from your site if they detect very fast downloads of big amounts of content. When downloading from Internet servers, consider using the "w" or "-random-wait" option as the above example to introduce a delay between accesses to the server. The download will take a while longer, but the server administrator will not be alarmed by that. The crawled data would be used to setup the search engine, but however, the urls are needed to retrieve back to the webpages and thus we need to setup log file to store that information.
    2. Instead of using reject list, another way of doing it, is to work with accept list -A ".html, .htm". (You can also add extensions like ".php", ".jsp" in your accept list if you believe contents on those pages are useful)
    3. '-nv', '-o logfile' should be used together with 'wget' in order to produce correct log file that will be used for generating XML file later.

    Option 2: Building your own cralwer.

    The use of Wget to crawl forum data may cause several problems: it may download a lot duplicate pages, and unuseful pages and information like ads; some long threads may span on several pages but the Wget can not recognize the connection. All these problems can potentially hurt the search performance on forum data. One solution is to build a specialized crawler for each forum site, so that we can only crawl the information we are intested in.

    Here are a few steps that you could follow to implement such specialized crawler: 1) extract links for different discussion boards on the forum. 2) extract links for different topics/threads in each discussion board; you may need to follow the 'next' links to get all topics. 3) for each thread, extract all the posts; you may need to follow the 'next' links to get all posts. Depth-first search and breadth-first search are both feasible.

    Tools: Python or Ruby may be more suitable for implementing the crawler compared to other programming languages. The BeautifulSoup library for python and Scrubyt library for Ruby make the pattern matching fairly simple. However, feel free to use other languages or tools in your implementation.

Task 2: Set up a Search Engine [60 points]

You will be using the Apache Lucene for this task. In addition, Solr will be used as web interface for search.

Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. (from wikipedia)
  1. Download Solr

    We prepare a reduced and customized version of Sorl at http://sifaka.cs.uiuc.edu/course/410s13/download/solr-4.1.0.tar.gz for you to check out. You can use ssh to upload to your server home folder. Or otherwise, simply by typing in the below command at the server end

    wget http://sifaka.cs.uiuc.edu/course/410s13/download/solr-4.1.0.tar.gz

    Uncompress this package after you download it!

    If you download the file before 2/14/2013, please follow the following procedure to show the document id in the search result page,

    1. Download a new richtext-doc.vm,
      
      wget http://sifaka.cs.uiuc.edu/course/410s13/download/richtext-doc.vm
      
    2. Substitute richtext-doc.vm in ./solr-4.1.0/example/solr/cs410/conf/velocity/ by the new file
      
      cp richtext-doc.vm $YOUR_SOLR_FOLDER/example/solr/cs410/conf/velocity/richtext-doc.vm
      
      for example, if your solr_folder is /home/zhou18/solr-4.1.0/, then run
      
      cp richtext-doc.vm /home/zhou18/solr-4.1.0/example/solr/cs410/conf/velocity/richtext-doc.vm
      
    3. Then your browser should show the document id below the snippet.
  2. Config Solr

    For each student, you will be assigned a unique port number (range from 8800 to 8950) assigned at the Forum List Signup page. You need the port number to set up your search engine. The Solr need to be configured so as to know which port number is for use:

    MODIFY "solr-4.1.0/example/etc/jetty.xml":

    Line 38: <Set name="port"><SystemProperty name="jetty.port" default="****"/></Set>

    Substitue **** with your own port number (each student will be assigned a different port number)

    EXAMPLE:

    Line 38: <Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
  3. Process Data

    At this step, you need to process the crawled data.

    Data Cleaning, Information Extraction and Data Formatting

    The crawled data need to be parsed by algoirthm to extract the metadata (e.g. titles of each page, url, etc.) as well as the main content. We provide a python script as an example for parsing crawled web content, and convert it to XML file which is defined by Solr. In general, for most forums, this script works fine. However, this is nevertheless a "universal" parser (there is barely universal simple web parser available anyway) and probably, you will need to modify this script or even write your own program if you like.

  4. Start Sorl

    Before uploading your formatted XML file, it is necessary to start the Sorl in order to initialize in order to process the input data.

    TERMINAL COMMAND:

    
    cd solr-4.1.0/example/
    java -Dsolr.clustering.enabled=true -jar start.jar
    	

    This process shall run during all the time so as to make the search engine remain valid. Consider use a "screen" command first and then start the java process.

    
    screen
    java -Dsolr.clustering.enabled=true -jar start.jar
    ctrl+A ctrl+D
    	

    When you want to retrive this process back:

    screen -r

    When you want to terminal the process:

    exit
  5. Post Data

    the XML format will need to be put into the solr data folder.

    DOWNLOAD:

    wget http://sifaka.cs.uiuc.edu/course/410s13/download/post.py

    USGAE:

    python post.py [port] [xml_file]

    EXAMPLE:

    python post.py 8983 doom9.xml

    CHECK:

    Suppose we use port 8893, if the data is uploaded successfully, you should be able to see the upload information at http://cs410-server.cs.illinois.edu:8983/solr/#/cs410

  6. Test Solr

    Upon this time, you shall be able to access the search engine and test it. The search engine will have the address as formatted below

    ADDRESS:

    HOSTNAME:PORT_NUMBER/solr/cs410/browse

    Suppose that

    
        HOSTNAME  = cs410-server.cs.illinois.edu
        PORT_NUMBER = 8983
        	

    EXAMPLE:

    http://cs410-server.cs.illinois.edu:8983/solr/cs410/browse

Task 3: Create Topics and Relevance Judgments [10 points]

Your task is to create a query (a question) and make relevance judgments on the results returned by your search engine for your query. We ask each of you to develop one query/topic and perform very preliminary relevance judgments. The purpose of the preliminary judgments is to ensure the quality of the query. For this purpose, you only need to judge the top 10 search results. Your chosen query should work reasonably well with your search engine to ensure that there are at least some relevant documents in the top 10 results. Later during the group search engine competition, we will ask you judge more search results for your query. More specifically, please follow the instructions below:

We ask you to form queries from questions on Yahoo! Answers because
  1. they represent real information needs;
  2. they are relatively complex information needs that fits the scenario of forum search. In case there are redundant or near-redundanti documents which are relevant, please mark all these redundant documents as relevant. (In real search applications, we often should remove redundancy, but for our assignment, we ignore this issue.)

Here is an example of the judgment file, "query_xwang95.txt":

What is the best rustic camping site in illinois?
rustic camping site illinois
xwang95_0
xwang95_1
xwang95_2
...

What to turn in

Please pack all the following files into one single zip file or tar file and upload the package to Compass by midnight of the due date