Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Assigments >> Assignment 1, Part 2
 
 

COMP348 Document Processing and the Semantic Web

Assignment 1, Part 2

Latest version: 10 April 2008
Version history
FAQ
Feedback

Background

The background is the same as for Assignment 1, Part 1.

Task

Your task is to build a rule-based classifier for dividing a corpus of blog into one of two classes, Young (13-17) or Old (33-47), based on the age of the blog poster. (Part 3 will use a machine learning approach to constructing a classifier.)

For example, your rules might look something like:

IF blog_post uses "lol" THEN category is Young

IF number of occurrence of urlLink in blog_post > 5 THEN category is Old

Your classifier should be written in Python, and should consist of at least the following two files:

  • classify_individual.py. This should take a single filename as an argument, and classify the file using only the file contents (i.e. not the filename) as Young or Old. If Young, it should output +1; if Old, it should output -1.
  • classify_corpus.py. This should iterate classify_individual.py over the whole set of files, which it should expect to find in the current directory. By comparing the outputs of classify_individual.py with the information contained in the filenames, it should determine the accuracy of the classifier.

Data

The data comes from the Blog Authorship Corpus. The corpus consists of a number of files, with names of the form ID.gender.age.job.starsign.xml; for example, 109656.male.36.LawEnforcement-Security.Pisces.xml.

Here is a sample of 20 files from each of the Young and Old categories; use this as your initial "training" data (i.e. the data you use to come up with your rules).

Here are larger data sets. The *_train data sets contain 1000 files each; the *_test data sets contain 200 files each. You might want to use the *_train data sets as the ones you develop your rules from, and the *_test ones as the ones you test their generality on.

Benchmark

To give you an idea of what accuracy should be achievable, and of a benchmark I'll be using for assigning the "quality of results" part of the assessment, I've implemented a simple algorithm. Evaluated on the test data, its accuracy is 255 / 400 (63.75%).

Deliverables

Report

You will submit a hardcopy and an electronic copy of a report consisting of the following sections:

  1. the problem (see Background above -- you can reuse the problem explanation from Part 1);
  2. the system -- what kinds of rules you used, and how you decided on them, and also how to run the system;
  3. the results -- your system's accuracy rate (on both training and test sets), a baseline accuracy rate, a note of the statistical test used to determine whether there is a statistically significant difference between these rates, and the result of this test; and
  4. the conclusion.

Your paper can broadly follow this sample report. Note that it should be longer than in Part 1, because you'll have to explain your features and rules.

The report should be more polished than the one for Part 1. It should look like an actual report (albeit shorter) of the sort you'll have to write when you go and work:

  • the report should have a title;
  • it should be free of spelling and typographical errors;
  • it should be in satisfactory English; and
  • it should have a decent layout (tables can be nice).

Here are a couple of nice example reports from Part 1: here and here.

Code

You will also submit Python code as specified above (both electronic and hardcopy versions)as well as relevant other files. These other files might include other Python programs called by the required ones, or data files (e.g. with a list of features) used by them.

Assessment

This part of Assignment 1 is worth 8%. The marking is broken down as follows:

  • [4.5 marks] quality of results (as well as general correctness, you'll get a higher mark for a higher accuracy rate);
  • [1.0 marks] quality of code;
  • [1.0 marks] correct calculation of your system's accuracy rate and determination of statistical significance;
  • [1.5 marks] quality of report.


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J