Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Assigments >> Assigment 1, Part 2 >> Assignment 1, Part 2: Benchmark
 
 

COMP348 Document Processing and the Semantic Web

Assignment 1, Part 2:
Benchmark

I've written a simple algorithm, sketched below. It's not meant to be the best at the task, but has two purposes:

  1. It will give you an idea of what kind of improvement you could expect (if you'd been wondering if your system should be getting 51% or 99%, for example).

  2. I'll use it when assigning marks: you'll get higher marks on the "quality of results" criterion if you score better than this algorithm.

It's also not meant to be (necessarily) the kind of algorithm you use. I don't mind if you adapt it, particularly if you come up with something better, but I won't be answering questions on it (of the sort "If I want to implement this, how do I do step X?").

The algorithm has the following steps:

  1. Determine features to look at in classify_individual.py:

    1. Tokenise the training files on whitespace.

    2. Count all of the words (ignoring XML) for Young and Old.

    3. Take the top N for each of Young and Old.

    4. Find the set differences (i.e. the elements of top N for Young that are not in the top N for Old, and vice versa). Call these OldFeatures and YoungFeatures.

  2. In classify_individual.py:

    1. Tokenise the test files on whitespace.

    2. Total up the number of occurrences of OldFeatures and of YoungFeatures. Call these OldCount and YoungCount.

    3. Assign the file according to whether OldCount or YoungCount is greater. If they are equal, assign randomly.

  3. In classify_corpus.py:

    1. Run classify_individual.py over the whole directory.

    2. Compare output with filename age indicator.

    3. Calculate accuracy


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J