The goal of this assignment is to set up a forum search engine. It is an individual assignment. You will be asked to crawl an Internet forum of interest to you, and then set up a search engine to support searching the crawled forum. You will be given an experimental search engine toolkit (Apache Lucene plus Solr), though you’ll need to configure it and process the data to actually make the search engine work appropriately. Finally, you’ll be asked to come up with a sample query with relevant documents in your crawled forum data, and make relevance judgments on a small number of top-ranked search results. Your queries and relevance judgments will be pooled together to be used in the next assignment, which is a group assignment of search engine competition.
In this homework, you need to use the cs410 server provided by the department. So before you start your homework, make sure that you can access cs410-server.cs.illinois.edu server through ssh with your NedId and password:
ssh YourNetID@cs410-server.cs.illinois.eduEnter the password you just retrieved to log in. Please make sure you have the access before you go on with your homework.
You may also need to remote copy files to the server. To do so, you can run the following command on your local machine (where your files reside):
scp YourLocalFile YourNetID@cs410-server.cs.illinois.edu:/home/cs/YourNetID/or
scp -r YourLocalFolder YourNetID@cs410-server.cs.illinois.edu:/home/cs/YourNetID/if you have a folder of files to upload. The file/folder will be put under your home folder /home/cs/YourNetID. You can also specify a different destination, e.g. /home/cs/YourNetID/Temp/, as long as the folder exists on the server and you have the permission to write it.
You have two TAs in this course, and they are two big fan of sports and recreations respectively. They use forums a lot and they are looking forward to having their own search engine for sports and recreation forums.
Big Boards has a great collection for different forums. You can find your choice of forums from the "Recreation" and "Sports" categories or instead you can use any other forums as long as they are relevant to "recreation" and "sports". There are a lot of subcategories for you to explore such as "basketball" and"baseball" in "sports" category. When selecting your forum, bear in mind that it should contain ample information for at least 1 question you find on the question answering website Yahoo! Answers. (You can choose the question freely --- There are a lot of people like your TAs are fans of sports and recreations.) The question you choose will be used in Task 3, and you may want to read the description of that question before finalizing which forum you would crawl.) To prevent duplicatework, please collaborate on maintaining the Forum List Signup page according to the instruction provided. Your choice of forums and questions should NOT be the same as others listed on the page so be the first one to claim YOUR forum.
There are two options for crawling the data. It is fairly simple to use Wget to crawl the data, but if you are interested, you are also encouraged to build your own crawler. Notice: since we later will combine all data you crawl, so each of you does not need to crawl too much data. The data crawled from one forum should NOT exceed 30MB, which means in most cases you only need to crawl part of the forums. You can control the amount of data that you crawl by using an appropriate option of Wget (please see descriptions of the options of Wget below).
Option 1: Using WgetWget is a general purpose crawler installed on almost all Linux distributions. Type
"man wget"on a linux machine to see how it works, or view the documentation here. Some of the important options are:
mkdir html cd html wget -nv -R gif,jpg,jpeg http://forum.doom9.org/ -r -nd --html-extension --random-wait -Q30m -o wget_log cd ..
-A ".html, .htm". (You can also add extensions like ".php", ".jsp" in your accept list if you believe contents on those pages are useful)
'-nv', '-o logfile'should be used together with
'wget'in order to produce correct log file that will be used for generating XML file later.
Option 2: Building your own cralwer.
The use of Wget to crawl forum data may cause several problems: it may download a lot duplicate pages, and unuseful pages and information like ads; some long threads may span on several pages but the Wget can not recognize the connection. All these problems can potentially hurt the search performance on forum data. One solution is to build a specialized crawler for each forum site, so that we can only crawl the information we are intested in.
Here are a few steps that you could follow to implement such specialized crawler: 1) extract links for different discussion boards on the forum. 2) extract links for different topics/threads in each discussion board; you may need to follow the 'next' links to get all topics. 3) for each thread, extract all the posts; you may need to follow the 'next' links to get all posts. Depth-first search and breadth-first search are both feasible.
Tools: Python or Ruby may be more suitable for implementing the crawler compared to other programming languages. The BeautifulSoup library for python and Scrubyt library for Ruby make the pattern matching fairly simple. However, feel free to use other languages or tools in your implementation.
You will be using the Apache Lucene for this task. In addition, Solr will be used as web interface for search.
Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. (from wikipedia)
We prepare a reduced and customized version of Sorl at
http://sifaka.cs.uiuc.edu/course/410s13/download/solr-4.1.0.tar.gz for you to
check out. You can use
ssh to upload to your server home folder. Or otherwise,
simply by typing in the below command at the server end
cp richtext-doc.vm $YOUR_SOLR_FOLDER/example/solr/cs410/conf/velocity/richtext-doc.vmfor example, if your solr_folder is /home/zhou18/solr-4.1.0/, then run
cp richtext-doc.vm /home/zhou18/solr-4.1.0/example/solr/cs410/conf/velocity/richtext-doc.vm
For each student, you will be assigned a unique port number (range from 8800 to 8950) assigned at the Forum List Signup page. You need the port number to set up your search engine. The Solr need to be configured so as to know which port number is for use:
Line 38: <Set name="port"><SystemProperty name="jetty.port" default="****"/></Set>
Substitue **** with your own port number (each student will be assigned a different port number)
Line 38: <Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
At this step, you need to process the crawled data.
The crawled data need to be parsed by algoirthm to extract the metadata (e.g. titles of each page, url, etc.) as well as the main content. We provide a python script as an example for parsing crawled web content, and convert it to XML file which is defined by Solr. In general, for most forums, this script works fine. However, this is nevertheless a "universal" parser (there is barely universal simple web parser available anyway) and probably, you will need to modify this script or even write your own program if you like.
python gen_xml.py [NET_ID] [FORUM_NAME] [HTML_FOLDER] [WGET_LOG] [XML_FILE]
python gen_xml.py xwang95 doom9 ./html ./html/wget_log doom9.xml
In which the NetId (xwang95) is used for assigning different document id among those from all students. Forum name is specified as "doom9", and the forum html files are stored in "./html" while the wget log has path "./html/wget_log". The output of the parser, which is the XML format file, is output to "doom9.xml"
However, if your selected forums, "unfortunately", cannot be parsed with our provided script, you will need to come up with your own program to extract the title, content, etc. from the crawled data and convert to XML file.
To add documents to the index, we post an XML representation of the fields
to index to the update URL. The XML looks like the example below, with a
<field></field> element for each field to index. Such documents represent the
metadata and content of the actual documents or business objects that we're
indexing. Any data is indexable as long as it can be converted to this simple
<add> <doc> <field name="id">NET_ID_DOC_ID</field> <field name="cat">FORUM_NAME</field> <field name="url">URL</field> <field name="title">TITLE</field> <field name="content">WEB_PAGE_CONTENT</field> </doc> <doc> ... </doc> </add>
<add></add> element tells Solr that we want to add the document to the
index (or replace it if it's already indexed), and with the default configuration,
the id field is used as a unique identifier for the document. Posting another
document with the same id will overwrite existing fields and add new ones to the
indexed data. For our assignment, to avoid conflict, we ask each of you to use
"netid_docid" as the id field. The docid should start from 0,1,
2..., which gives id such as "xwang95_0", "xwang95_1", "xwang95_2".
For Further Reading: http://wiki.apache.org/solr/UpdateXmlMessages
Before uploading your formatted XML file, it is necessary to start the Sorl in order to initialize in order to process the input data.
cd solr-4.1.0/example/ java -Dsolr.clustering.enabled=true -jar start.jar
This process shall run during all the time so as to make the search engine remain valid. Consider use a "screen" command first and then start the java process.
screen java -Dsolr.clustering.enabled=true -jar start.jar ctrl+A ctrl+D
When you want to retrive this process back:
When you want to terminal the process:
python post.py [port] [xml_file]
python post.py 8983 doom9.xml
Suppose we use port 8893, if the data is uploaded successfully, you should be able to see the upload information at http://cs410-server.cs.illinois.edu:8983/solr/#/cs410
Upon this time, you shall be able to access the search engine and test it. The search engine will have the address as formatted below
HOSTNAME = cs410-server.cs.illinois.edu PORT_NUMBER = 8983
Your task is to create a query (a question) and make relevance judgments on the results returned by your search engine for your query. We ask each of you to develop one query/topic and perform very preliminary relevance judgments. The purpose of the preliminary judgments is to ensure the quality of the query. For this purpose, you only need to judge the top 10 search results. Your chosen query should work reasonably well with your search engine to ensure that there are at least some relevant documents in the top 10 results. Later during the group search engine competition, we will ask you judge more search results for your query. More specifically, please follow the instructions below:
Here is an example of the judgment file, "query_xwang95.txt":
What is the best rustic camping site in illinois? rustic camping site illinois xwang95_0 xwang95_1 xwang95_2 ...