| |
COMP348 Document Processing and the Semantic Web
Assignment 1, Part 3
Latest version: 30 April 2008
Version history
FAQ
Feedback
Background
The background is the same as for Assignment 1, Part 1.
Task
Your task is to build a classifier based on SVM-light for dividing a corpus of blog posts
into one of two classes, Young (13-17) or Old (33-47), based on the age of the blog poster.
If you haven't done them already, make sure you do the practical exercises for
week 6 and
week 7 to get
the hang of how SVM-light works.
Your classifier should be written in Python and call SVM-light (as in the week 6 exercises),
and it should consist of at least the following files:
-
process.py. This should process the training or test corpus given the features
you have chosen in order to produce the feature vectors for svm_learn
and svm_classify in SVM-light format. It should take as arguments the directory name
which is the location of the corpus, and the name of the output file containing
the feature vectors.
process.py corpus_dir features.dat
-
learn.py. This should call process.py on the training corpus
to produce a file train.dat of feature vectors as input for svm_learn; it
should then call svm_learn to produce a model file.
It should take as arguments
a directory name which is the location
of the training corpus, a filename which contains the training data for svm_learn,
and a filename for storing the model produced by svm_learn:
learn.py training_dir train.dat model.dat
-
classify.py. This should call process.py on the test corpus
to produce a file test.dat of feature vectors as input for svm_classify; it
should then call svm_classify using the model file from the learner to produce a predictions file
and a results file (which contains the stdout output from svm_classify).
It should take as arguments
a directory name which is the location
of the test corpus, a filename which contains the test data for svm_classify,
a filename which contains the model produced by svm_learn, a filename
for the predictions produced by svm_classify, and a filename for the
results produced by svm_classify:
classify.py test_dir test.dat model.dat predictions.dat results.dat
Data
The data comes from the Blog
Authorship Corpus. The corpus consists of a number of files, with names of
the form ID.gender.age.job.starsign.xml; for example,
109656.male.36.LawEnforcement-Security.Pisces.xml.
Here is a sample of 20 files from each of the
Young and Old categories;
use this as your initial "training" data (i.e. the data you use to come up with
your features).
Here are larger data sets. The *_train data sets contain 1000 files each;
the *_test data sets contain 200 files each. You should use the *_train
data sets as the ones you develop your rules from, and the *_test ones as the ones
you test their generality on.
Deliverables
Report
You will submit a hardcopy and an electronic copy of a report consisting
of the following sections:
-
the problem (see Background above -- you can reuse the problem explanation from Part 1);
-
the system -- what kinds of features you used, and how you decided on them, and also how to run the system;
-
the results:
-
your SVM-based system's accuracy rate (on both training and test sets),
-
a baseline accuracy rate, and
-
as a second point of comparison, the accuracy rate from your system from Part 2
(the rule-based system);
and in addition,
-
a note of the statistical test used to determine whether there is a statistically significant
difference between your SVM system and the baseline, and the result of this test, and
-
a note of the statistical test used to determine whether there is a statistically significant
difference between your SVM system and your rule-based system, and the result of this test; and
-
the conclusion.
Your paper can broadly follow this sample report.
Note that it should be longer than in Part 1, because you'll have to explain your
features.
The report should be more polished than the one for Part 1. It should look like an actual report (albeit
shorter) of the sort you'll have to write when you go and work:
-
the report should have a title;
-
it should be free of spelling and typographical errors;
-
it should be in satisfactory English; and
-
it should have a decent layout (tables can be nice).
Code
You will also submit Python code as specified above (both electronic and hardcopy versions), as
well as relevant other files.
Important note: These relevant other files must also include the model.dat
file the SVM produces on the training data; it should be called model.dat.
Assessment
This part of Assignment 1 is worth 12%.
The marking is broken down as follows:
-
[4.5 marks] quality of results (as well as general correctness,
you'll get a higher mark for a higher accuracy rate);
-
[3.0 marks] quality of code;
-
[2.0 marks] correct calculation of your system's accuracy rate
and determination of statistical significance;
-
[2.5 marks] quality of report.
Mark Dras or
|