Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Assigments >> Assignment 2
 
 

COMP348 Document Processing and the Semantic Web

Assignment 2 - Folksonomies

FAQ
Latest version: 14 May 2008
Version history:

8 May 2008
Draft released
13 May 2008
Assignment released
14 May 2008
New, smaller and cleaner graph data

Background

Task

This task is about a real problem that affects a real company. Reed Business Information (http://www.reedbusiness.com.au) is a major Australian publisher who is maintaining hotfrog (http://www.hotfrog.com.au/), one of Australia's largest business directories.

Every business listing in hotfrog contains, among other things, a list of tags. These tags have been added by those who entered the business, creating a folksonomy of business tags. However, some of the tags are not informative for someone who would want to search in the directory.

In particular, some tags are too vague, such as "business". Your task is to find a method to spot the vague tags. For this you will use a SVM classifier. You need to decide what features to use, implement a script that extracts the features, run the classifier, evaluate it, and report the results. There will be a bonus section that will give you extra marks if you fine-tune the classifier and you can show that the results improve after fine-tuning.

Resources Available

You have available the following resources:

  • An undirected graph of business identifiers (you don't need to know the business names for this task) and the list of tags associated to it. The graph is provided as a list of edges. Here is an extract:
    "93751","Advertising Agencies"
    "93752","advertising"
    "93752","branded content"
    "93752","commercial"
    "93752","Commercials/corporate Production"
    "93752","Film Tv & Video Production Units"
    "93752","production"
    "93753","internet marketing"
    "93753","internet services"
    
    This extract lists three business. Business 93751 has one tag, business 93752 has 7 tags, and business 93753 has three tags.
    • Download the graph. This download is a zipped file of 13Mb. Once unzipped the file is over 60Mb. If you are working at home under a slow connection you may prefer to download the file in the PC lab and transfer it to a USB memory stick.
  • An excerpt of annotated tags. This excerpt is a table with two columns. The first column is the tag name. The second column is 1 if the tag is annotated as vague. Here is a sample:
    "Accountants",
    "Accounting Services",
    "Accommodation",1
    "Restaurants",
    "Financial Services",1
    "Medical Specialists",1
    "Business Services",1
    "Car Repair & Service",
    
    According to this extract, "Accountants" is not vague and "Accommodation" is vague. For your convenience we have split the file into two parts, one with 5000 tags that you can use for training, and another with 1000 tags that you can use for testing.

Features

You need to develop your own features, but here are some indications of the kinds of features you could try:

  • Number of business tagged with this tag. You could try absolute values or relative to the maximum number of business.
  • Number of tags that share a business with this tag. Again this could be an absolute or a relative number.
  • Maximum conditional tagging probability. P(A|B) is the number of times a business is tagged with both A and B, divided by the number of times a business is tagged with B. The maximum conditional probability associated to tag B is the maximum of all P(A|B). You can also try average and minimum conditional tagging probability.
  • PageRank of the tag. You can build a graph that follows the conditional tagging probabilities and compute the pagerank of the vertices. Note that this could be very time consuming to compute.

Using the Classifier

In the practicals and past assignment you have practised with svm-Light. If you use it for this assignment you will probably find out that, regardless of what features you choose, the final classifier will simply mark all tags as non-vague. This is the outcome that maximises the classifier accuracy but it is useless for us. In particular, recall is 0 (no tags are marked as vague) and precision is undefined (0/0). In this assignment you will aim at optimising recall and precision.

Instead of SVM-Light, in this assignment you will use LIBSVN. This classifier has an option that trains the model for probability estimates and uses the model to compute the classification probability. The format of the feature file is exactly the same as SVM-Light, and the usage is virtually the same:

  • To train the classifier and create a model you must type the following:
  • svm-train -b 1 train.dat model
    
  • To classify unseen data type the following:
  • svm-predict -b 1 test.dat model prediction
    

The option -b 1 instructs the program to use probability estimates. The first lines of the resulting prediction file look like this:

labels -1 1
-1 0.715342 0.284658
-1 0.787789 0.212211
-1 0.788941 0.211059
-1 0.787977 0.212023
-1 0.790036 0.209964
-1 0.790933 0.209067
-1 0.788154 0.211846
-1 0.785727 0.214273
-1 0.787875 0.212125

Each line indicates the classification outcome (-1 in all the lines above), the probability of class -1, and the probability of class 1. Since you have the probabilities now you can override the decision of the classifier by changing the probability threshold in order to optimise recall and precision. For example, you could decide that all tags with probabiliy of class 1 over 0.25 should be classified as vague. Then the first instance of the above example (line 2) would be classified as vague.

What You Need to Submit

Your classifier should be written in Python and call LIBSVM, and it should consist of at least the following files:

  • process.py. This should process the graph given the features you have chosen and produce the feature vectors for svm-train and svm-predict in LIBSVM format (which, incidentally, is the same format as SVM-Light). It should take as arguments the name of the file containing the graph information, the name of the file containing the annotated tags, and the name of the output file containing the feature vectors.
    process.py graph.csv annotation.csv features.dat	
    
  • learn.py. This should call process.py on the training corpus to produce a file train.dat of feature vectors as input for svm-train; it should then call svm-train to produce a model file. It should take as arguments the name of the file containing the graph information, the name of the file containing the annotated tags, the name of the file that contains the data to be sent to svm-train, and the name of the file storing the model produced by svm-train:
    learn.py graph.csv train.csv train.dat model.mdl
    
  • classify.py. This should call process.py on the test corpus to produce a file test.dat of feature vectors as input for svm-predict; it should then call svm-predict using the model file from the learner to produce a predictions file (which is the output file of svm-predict). Then it should use a predefined probability threshold (which you need to decide) to produce the final result file. It should take as arguments the name of the file which contains the test data of the graph, the name of the file that contains the correct annotations, a filename which contains the model produced by svm-predict, a filename for the predictions produced by svm-predict, and a filename for the final results:
    classify.py graph.csv test.csv test.dat model.mdl predictions.dat results.res
    
    The format of the file results.res is the same as predictions.dat, this time with the final predictions.

Deliverables

Report

You will submit an electronic copy of a report consisting of the following sections (no hardcopy is required):

The Problem
Basically summarise what these specifications say so that anybody can understand the goal of the assignment work.
The System
What kinds of features you used and how you decided on them, and also how to run the system.
The Results
Your SVM-base system's recall, precision and F-score on the test set. Include a table that shows how these evaluation values change as you change the probability threshold and justify your choice of threshold.
The Conclusion

Your paper can broadly follow this sample report.

The report should be more polished than the one for Assignment 1 Part 3.

  • the report should have a title;
  • it should be free of spelling and typographical errors;
  • it should be in satisfactory English; and
  • it should have a decent layout (tables can be nice).

Code

You will also submit Python code as specified above (as separate files so that we can run your system), as well as relevant other files. Important note: These relevant other files must also include the model.dat file the SVM produces on the training data; it should be called model.dat.

Bonus Part

Optionally for some extra marks you can fine-tune the SVM classifier to optimise the results. Read this guide, which explains how you can determine optimal parameters for the classifier. In your written report include an additional section:

Fine Tuning
Explain what steps you followed to fine-tune the classifier parameters. Include charts, and motivate the choice of parameters.

Assessment

This assignment is worth 15% plus 3% for the bonus part. The marking is broken down as follows:

  • [5.5 marks] quality of results (as well as general correctness, you'll get a higher mark for a higher accuracy rate);
  • [4.0 marks] quality of code;
  • [2.0 marks] correct calculation of your system's accuracy values and correct determination of the threshold values;
  • [3.5 marks] quality of report.


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J