Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Assigments >> Assignment 1, Part 3
 
 

COMP348 Document Processing and the Semantic Web

Assignment 1, Part 3

Latest version: 30 April 2008
Version history
FAQ
Feedback

Background

The background is the same as for Assignment 1, Part 1.

Task

Your task is to build a classifier based on SVM-light for dividing a corpus of blog posts into one of two classes, Young (13-17) or Old (33-47), based on the age of the blog poster.

If you haven't done them already, make sure you do the practical exercises for week 6 and week 7 to get the hang of how SVM-light works.

Your classifier should be written in Python and call SVM-light (as in the week 6 exercises), and it should consist of at least the following files:

  • process.py. This should process the training or test corpus given the features you have chosen in order to produce the feature vectors for svm_learn and svm_classify in SVM-light format. It should take as arguments the directory name which is the location of the corpus, and the name of the output file containing the feature vectors.
    process.py corpus_dir features.dat	
    
  • learn.py. This should call process.py on the training corpus to produce a file train.dat of feature vectors as input for svm_learn; it should then call svm_learn to produce a model file. It should take as arguments a directory name which is the location of the training corpus, a filename which contains the training data for svm_learn, and a filename for storing the model produced by svm_learn:
    learn.py training_dir train.dat model.dat
    
  • classify.py. This should call process.py on the test corpus to produce a file test.dat of feature vectors as input for svm_classify; it should then call svm_classify using the model file from the learner to produce a predictions file and a results file (which contains the stdout output from svm_classify). It should take as arguments a directory name which is the location of the test corpus, a filename which contains the test data for svm_classify, a filename which contains the model produced by svm_learn, a filename for the predictions produced by svm_classify, and a filename for the results produced by svm_classify:
    classify.py test_dir test.dat model.dat predictions.dat results.dat
    

Data

The data comes from the Blog Authorship Corpus. The corpus consists of a number of files, with names of the form ID.gender.age.job.starsign.xml; for example, 109656.male.36.LawEnforcement-Security.Pisces.xml.

Here is a sample of 20 files from each of the Young and Old categories; use this as your initial "training" data (i.e. the data you use to come up with your features).

Here are larger data sets. The *_train data sets contain 1000 files each; the *_test data sets contain 200 files each. You should use the *_train data sets as the ones you develop your rules from, and the *_test ones as the ones you test their generality on.

Deliverables

Report

You will submit a hardcopy and an electronic copy of a report consisting of the following sections:

  1. the problem (see Background above -- you can reuse the problem explanation from Part 1);
  2. the system -- what kinds of features you used, and how you decided on them, and also how to run the system;
  3. the results:
    • your SVM-based system's accuracy rate (on both training and test sets),
    • a baseline accuracy rate, and
    • as a second point of comparison, the accuracy rate from your system from Part 2 (the rule-based system);
    and in addition,
    • a note of the statistical test used to determine whether there is a statistically significant difference between your SVM system and the baseline, and the result of this test, and
    • a note of the statistical test used to determine whether there is a statistically significant difference between your SVM system and your rule-based system, and the result of this test; and
  4. the conclusion.

Your paper can broadly follow this sample report. Note that it should be longer than in Part 1, because you'll have to explain your features.

The report should be more polished than the one for Part 1. It should look like an actual report (albeit shorter) of the sort you'll have to write when you go and work:

  • the report should have a title;
  • it should be free of spelling and typographical errors;
  • it should be in satisfactory English; and
  • it should have a decent layout (tables can be nice).

Code

You will also submit Python code as specified above (both electronic and hardcopy versions), as well as relevant other files. Important note: These relevant other files must also include the model.dat file the SVM produces on the training data; it should be called model.dat.

Assessment

This part of Assignment 1 is worth 12%. The marking is broken down as follows:

  • [4.5 marks] quality of results (as well as general correctness, you'll get a higher mark for a higher accuracy rate);
  • [3.0 marks] quality of code;
  • [2.0 marks] correct calculation of your system's accuracy rate and determination of statistical significance;
  • [2.5 marks] quality of report.


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J