Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Assigments >> Assignment 1, Part 1
 
 

COMP348 Document Processing and the Semantic Web

Assignment 1, Part 1

Latest version: 13 March 2008
Version history
FAQ
Feedback

Background

A task that has interested many organisations around the world is to determine the characteristics of the writers of particular texts. Is the writer young or old? Male or female? Two examples of such organisations:

  • Marketers are interested in it -- what is this person like, and by extension, what are they interested in buying? Amazon's recommender system lets you know what other people like you bought, where this similarity is determined by previous purchases. Google's AdWords, for placing internet ads, can help you determine the best locations to post them based on demographic data: the AdWords "Placement Tool will ... return a list of sites whose audience tends to match [the] demographic descriptions [you request]".
  • Governments and Departments of Defence might be interested in it -- if they have a lot of information to scan, and want to identify text that is likely to require closer examination, they might use demographic characteristics to do this. (I am not suggesting, of course, that our government might do such a thing.)

In terms of current work, a lot of demographic data in industry comes from the information you fill in when you subscribe to a site. Here's a fairly typical privacy policy (from WiredStart):

Our site's registration form requires users to give us contact information 
(like their name, email, and postal address), and demographic information 
(like their zip code, age, or income level). 

Contact information from the registration forms is used to get in touch 
with the customer when necessary. 

Users may opt-out of receiving future mailings; see the choice/opt-out section below. 

Demographic and profile data is also collected at our site. 

This information is shared with advertisers on an aggregate basis. We use this data 
to tailor our visitor's experience at our site showing them content that we think 
they might be interested in, and displaying the content according to their preferences. 

There's also some other work on automatically trying to identify demographic characteristics. One example is described in the following paper:

Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham and Will Radford. 2007. Author Profiling for English Emails. Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), 263-272. Melbourne, Australia.

One particular instance of this type of problem would be, given a (set of) blog post(s), to identify automatically the age group of the poster. A later part of Assignment 1 will be to build such a system for yourself. Part 1 will just assume a hypothetical system, ask you to analyse its results in the way you'll be expected to for your own system, and write a short report about it.

Task

Imagine the existence of some hypothetical system for classifying the age of the writer of a piece of text, into one of two classes, Young (13-17) or Old (33-47). We'll call the system AdAge. You have a number of blog posts for which you know the actual age category.

You're given data which represents AdAge's attempted classification of the blog posts. The task is:

  1. to calculate the accuracy rate of AdAge;
  2. to compare this against a most-frequent-category baseline, and determine whether the difference is statistically significant; and
  3. to write a short (one page) report on it.

Data

There will be two files, one representing the actual class, and one representing the AdAge classification. Both will contain a number of values, either +1 or -1, one per line; +1 indicates that the blog poster is Young, -1 Old. So the file will look something like:

+1
-1
-1
+1
-1
...

The lines in each file will correspond: the first line in both files will correspond to the classification of blog post #1, the second line to blog post #2, and so on. (You'll discover the reason for this format in a later part of the assignment.) You will have your individual data files made available to you via Blackboard. The two files should be called model-NNNNNNNN.txt and system-NNNNNNNN.txt, where NNNNNNNN is your 8-digit student number.

Here are two sample files: model-12345675.txt, representing the actual class of the data; and system-12345675.txt, representing the AdAge classification. Do not use these sample files in your assignment; use the ones available to you in Blackboard. The sample files only have 20 entries; your own files will have a lot more.

Deliverables

You will submit a hardcopy and an electronic copy of a one-page report consisting of the following sections:

  1. the problem (see Background above);
  2. the system (you don't know how AdAge works, but this section will act as a placeholder for a similar report you'll write for the later parts of this assignment);
  3. the results -- the AdAge accuracy rate, the baseline accuracy rate, a note of the statistical test used to determine whether there is a statistically significant difference between these rates, and the result of this test; and
  4. the conclusion.

Your paper can broadly follow this sample report.

You will also submit Python code for processing the data: reading it in, calculating the accuracy rates, and determining statistical significance.

Note that the hardcopy can be submitted up until Tuesday 25 March at noon and still count as being on time. (I don't expect you to physically deliver it on Easter Day.) Electronic submission should still be datestamped before Sunday 11.59pm to count as on time.

Assessment

This part of Assignment 1 is worth 5%. The marking is broken down as follows:

  • [0.5 marks] determination of baseline;
  • [1.5 marks] correct calculation of AdAge accuracy rate;
  • [1.0 marks] determination of statistical significance;
  • [2.0 marks] quality of report.


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J