| Computing >> CLT >> COMP348 home >> Assigments >> Assignment 1, Part 2 |
COMP348 Document Processing and the Semantic WebAssignment 1, Part 2
Latest version: 10 April 2008 BackgroundThe background is the same as for Assignment 1, Part 1. TaskYour task is to build a rule-based classifier for dividing a corpus of blog into one of two classes, Young (13-17) or Old (33-47), based on the age of the blog poster. (Part 3 will use a machine learning approach to constructing a classifier.) For example, your rules might look something like: IF blog_post uses "lol" THEN category is Young IF number of occurrence of urlLink in blog_post > 5 THEN category is Old Your classifier should be written in Python, and should consist of at least the following two files:
Data
The data comes from the Blog
Authorship Corpus. The corpus consists of a number of files, with names of
the form Here is a sample of 20 files from each of the Young and Old categories; use this as your initial "training" data (i.e. the data you use to come up with your rules).
Here are larger data sets. The
BenchmarkTo give you an idea of what accuracy should be achievable, and of a benchmark I'll be using for assigning the "quality of results" part of the assessment, I've implemented a simple algorithm. Evaluated on the test data, its accuracy is 255 / 400 (63.75%). DeliverablesReportYou will submit a hardcopy and an electronic copy of a report consisting of the following sections:
Your paper can broadly follow this sample report. Note that it should be longer than in Part 1, because you'll have to explain your features and rules. The report should be more polished than the one for Part 1. It should look like an actual report (albeit shorter) of the sort you'll have to write when you go and work:
Here are a couple of nice example reports from Part 1: here and here. CodeYou will also submit Python code as specified above (both electronic and hardcopy versions)as well as relevant other files. These other files might include other Python programs called by the required ones, or data files (e.g. with a list of features) used by them. AssessmentThis part of Assignment 1 is worth 8%. The marking is broken down as follows:
Mark Dras or Diego Molla |
