| |
COMP348 Document Processing and the Semantic Web
Assignment 2 - Folksonomies
FAQ
Latest version: 14 May 2008
Version history:
- 8 May 2008
- Draft released
- 13 May 2008
- Assignment released
- 14 May 2008
- New, smaller and cleaner graph data
Background
Task
This task is about a real problem that affects a real
company. Reed Business Information
(http://www.reedbusiness.com.au)
is a major Australian publisher who is maintaining hotfrog
(http://www.hotfrog.com.au/),
one of Australia's largest business directories.
Every business listing in hotfrog contains, among other things, a
list of tags. These tags have been added by those who entered the
business, creating a folksonomy of business tags. However, some of
the tags are not informative for someone who would want to search in
the directory.
In particular, some tags are too vague, such as "business". Your
task is to find a method to spot the vague tags. For this you will
use a SVM classifier. You need to decide what features to use,
implement a script that extracts the features, run the classifier,
evaluate it, and report the results. There will be a bonus section
that will give you extra marks if you fine-tune the classifier and
you can show that the results improve after fine-tuning.
Resources Available
You have available the following resources:
- An undirected graph of business identifiers (you don't need to know the
business names for this task) and the list of tags associated to
it. The graph is provided as a list of edges. Here is an extract:
"93751","Advertising Agencies"
"93752","advertising"
"93752","branded content"
"93752","commercial"
"93752","Commercials/corporate Production"
"93752","Film Tv & Video Production Units"
"93752","production"
"93753","internet marketing"
"93753","internet services"
This extract lists three business. Business 93751 has one tag,
business 93752 has 7 tags, and business 93753 has three tags.
- Download the graph. This
download is a zipped file of 13Mb. Once unzipped the file is
over 60Mb. If you are working at home under a slow connection you
may prefer to download the file in the PC lab and transfer it to a
USB memory stick.
- An excerpt of annotated tags. This excerpt is a table with two
columns. The first column is the tag name. The second column is 1 if
the tag is annotated as vague. Here is a sample:
"Accountants",
"Accounting Services",
"Accommodation",1
"Restaurants",
"Financial Services",1
"Medical Specialists",1
"Business Services",1
"Car Repair & Service",
According to this extract, "Accountants" is not vague and
"Accommodation" is vague. For your convenience we have split the file
into two parts, one with 5000 tags that you can use for training, and
another with 1000 tags that you can use for testing.
Features
You need to develop your own features, but here are some
indications of the kinds of features you could try:
- Number of business tagged with this tag. You could try absolute
values or relative to the maximum number of business.
- Number of tags that share a business with this tag. Again this could be an
absolute or a relative number.
- Maximum conditional tagging probability. P(A|B) is the number of times a
business is tagged with both A and B, divided by the number of times
a business is tagged with B. The maximum conditional probability
associated to tag B is the maximum of all P(A|B). You can also try
average and minimum conditional tagging probability.
- PageRank of the tag. You can build a graph that follows the
conditional tagging probabilities and compute the pagerank of the
vertices. Note that this could be very time consuming to compute.
Using the Classifier
In the practicals and past assignment you have practised with
svm-Light. If you use it for this assignment you will probably find
out that, regardless of what features you choose, the final
classifier will simply mark all tags as non-vague. This is the
outcome that maximises the classifier accuracy but it is useless for
us. In particular, recall is 0 (no tags are marked as vague) and
precision is undefined (0/0). In this assignment you will aim at
optimising recall and precision.
Instead of SVM-Light, in this assignment you will
use LIBSVN. This
classifier has an option that trains the model for probability
estimates and uses the model to compute the classification
probability. The format of the feature file is exactly the same as
SVM-Light, and the usage is virtually the same:
- To train the classifier and create a model you must type the
following:
svm-train -b 1 train.dat model
To classify unseen data type the following:
svm-predict -b 1 test.dat model prediction
The option -b 1 instructs the program to use
probability estimates. The first lines of the resulting prediction
file look like this:
labels -1 1
-1 0.715342 0.284658
-1 0.787789 0.212211
-1 0.788941 0.211059
-1 0.787977 0.212023
-1 0.790036 0.209964
-1 0.790933 0.209067
-1 0.788154 0.211846
-1 0.785727 0.214273
-1 0.787875 0.212125
Each line indicates the classification outcome (-1 in all the lines
above), the probability of class -1, and the probability of class
1. Since you have the probabilities now you can override the
decision of the classifier by changing the probability threshold in
order to optimise recall and precision. For example, you could
decide that all tags with probabiliy of class 1 over 0.25 should
be classified as vague. Then the first instance of the above
example (line 2) would be classified as vague.
What You Need to Submit
Your classifier should be written in Python and call LIBSVM, and it
should consist of at least the following files:
-
process.py. This should process the graph given
the features you have chosen and produce the feature vectors
for svm-train and svm-predict in
LIBSVM format (which, incidentally, is the same format as
SVM-Light). It should take as arguments the name of the file
containing the graph information, the name of the file
containing the annotated tags, and the name of the output file
containing the feature vectors.
process.py graph.csv annotation.csv features.dat
-
learn.py. This should
call process.py on the training corpus to produce
a file train.dat of feature vectors as input
for svm-train; it should then
call svm-train to produce a model file. It
should take as arguments the name of the file containing the
graph information, the name of the file containing the
annotated tags, the name of the file that contains the data to
be sent to svm-train, and the name of the file
storing the model produced by svm-train:
learn.py graph.csv train.csv train.dat model.mdl
-
classify.py. This should
call process.py on the test corpus to produce a
file test.dat of feature vectors as input
for svm-predict; it should then
call svm-predict using the model file from the
learner to produce a predictions file (which is the output file
of svm-predict). Then it should use a predefined
probability threshold (which you need to decide) to produce
the final result file. It should take as arguments the
name of the file which contains the test data of the graph,
the name of the file that contains the correct annotations, a
filename which contains the model produced
by svm-predict, a filename for the predictions
produced by svm-predict, and a filename for the
final results:
classify.py graph.csv test.csv test.dat model.mdl predictions.dat results.res
The format of the file results.res is the same
as predictions.dat, this time with the final
predictions.
Deliverables
Report
You will submit an electronic copy of a report consisting of the
following sections (no hardcopy is required):
-
The Problem
- Basically summarise what these specifications say so that
anybody can understand the goal of the assignment work.
-
The System
- What kinds of features you used and how you decided on them, and
also how to run the system.
-
The Results
-
Your SVM-base system's recall, precision and F-score on the test
set. Include a table that shows how these evaluation values change as
you change the probability threshold and justify your choice of
threshold.
-
The Conclusion
Your paper can broadly follow
this sample report.
The report should be more polished than the one for Assignment 1 Part 3.
-
the report should have a title;
-
it should be free of spelling and typographical errors;
-
it should be in satisfactory English; and
-
it should have a decent layout (tables can be nice).
Code
You will also submit Python code as specified above (as separate files
so that we can run your system), as well as relevant other files.
Important note: These relevant other files must also include
the model.dat file the SVM produces on the training data;
it should be called model.dat.
Bonus Part
Optionally for some extra marks you can fine-tune the SVM
classifier to optimise the
results. Read this
guide, which explains how you can determine optimal parameters
for the classifier. In your written report include an additional
section:
- Fine Tuning
-
Explain what steps you followed to fine-tune the classifier
parameters. Include charts, and motivate the choice of parameters.
Assessment
This assignment is worth 15% plus 3% for the bonus part.
The marking is broken down as follows:
-
[5.5 marks] quality of results (as well as general correctness,
you'll get a higher mark for a higher accuracy rate);
-
[4.0 marks] quality of code;
-
[2.0 marks] correct calculation of your system's accuracy
values and correct determination of the threshold values;
-
[3.5 marks] quality of report.
Mark Dras or
|