| |
COMP348 Document Processing and the Semantic Web
Assignment 1, Part 2: Benchmark
I've written a simple algorithm, sketched below. It's not meant to be the best at the task, but has two
purposes:
-
It will give you an idea of what kind of improvement you could expect (if you'd been wondering if your
system should be getting 51% or 99%, for example).
-
I'll use it when assigning marks: you'll get higher marks on the "quality of results" criterion if you
score better than this algorithm.
It's also not meant to be (necessarily) the kind of algorithm you use. I don't mind if you adapt it,
particularly if you come up with something better, but I won't be answering questions on it (of the sort
"If I want to implement this, how do I do step X?").
The algorithm has the following steps:
-
Determine features to look at in classify_individual.py:
-
Tokenise the training files on whitespace.
-
Count all of the words (ignoring XML) for Young and Old.
-
Take the top N for each of Young and Old.
-
Find the set differences (i.e. the elements of top N for Young that are not in the
top N for Old, and vice versa). Call these OldFeatures and YoungFeatures.
-
In classify_individual.py:
-
Tokenise the test files on whitespace.
-
Total up the number of occurrences of OldFeatures and of YoungFeatures. Call
these OldCount and YoungCount.
-
Assign the file according to whether OldCount or YoungCount is greater. If they are
equal, assign randomly.
-
In classify_corpus.py:
-
Run classify_individual.py over the whole directory.
-
Compare output with filename age indicator.
-
Calculate accuracy
Mark Dras or
|