| Computing >> CLT >> COMP348 home >> Assigments >> Assignment 1, Part 1 |
COMP348 Document Processing and the Semantic WebAssignment 1, Part 1
Latest version: 13 March 2008 BackgroundA task that has interested many organisations around the world is to determine the characteristics of the writers of particular texts. Is the writer young or old? Male or female? Two examples of such organisations:
In terms of current work, a lot of demographic data in industry comes from the information you fill in when you subscribe to a site. Here's a fairly typical privacy policy (from WiredStart): Our site's registration form requires users to give us contact information (like their name, email, and postal address), and demographic information (like their zip code, age, or income level). Contact information from the registration forms is used to get in touch with the customer when necessary. Users may opt-out of receiving future mailings; see the choice/opt-out section below. Demographic and profile data is also collected at our site. This information is shared with advertisers on an aggregate basis. We use this data to tailor our visitor's experience at our site showing them content that we think they might be interested in, and displaying the content according to their preferences. There's also some other work on automatically trying to identify demographic characteristics. One example is described in the following paper: Dominique Estival, Tanja Gaustad, Ben Hutchinson, Son Bao Pham and Will Radford. 2007. Author Profiling for English Emails. Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), 263-272. Melbourne, Australia. One particular instance of this type of problem would be, given a (set of) blog post(s), to identify automatically the age group of the poster. A later part of Assignment 1 will be to build such a system for yourself. Part 1 will just assume a hypothetical system, ask you to analyse its results in the way you'll be expected to for your own system, and write a short report about it. TaskImagine the existence of some hypothetical system for classifying the age of the writer of a piece of text, into one of two classes, Young (13-17) or Old (33-47). We'll call the system AdAge. You have a number of blog posts for which you know the actual age category. You're given data which represents AdAge's attempted classification of the blog posts. The task is:
DataThere will be two files, one representing the actual class, and one representing the AdAge classification. Both will contain a number of values, either +1 or -1, one per line; +1 indicates that the blog poster is Young, -1 Old. So the file will look something like: +1 -1 -1 +1 -1 ...
The lines in each file will correspond: the first line in both files
will correspond to the classification
of blog post #1, the second line to blog post #2, and so on.
(You'll discover the reason for this format in a later part of the assignment.)
You will have your individual data files made available to you via Blackboard.
The two files should be called Here are two sample files: model-12345675.txt, representing the actual class of the data; and system-12345675.txt, representing the AdAge classification. Do not use these sample files in your assignment; use the ones available to you in Blackboard. The sample files only have 20 entries; your own files will have a lot more. DeliverablesYou will submit a hardcopy and an electronic copy of a one-page report consisting of the following sections:
Your paper can broadly follow this sample report. You will also submit Python code for processing the data: reading it in, calculating the accuracy rates, and determining statistical significance. Note that the hardcopy can be submitted up until Tuesday 25 March at noon and still count as being on time. (I don't expect you to physically deliver it on Easter Day.) Electronic submission should still be datestamped before Sunday 11.59pm to count as on time. AssessmentThis part of Assignment 1 is worth 5%. The marking is broken down as follows:
Mark Dras or Diego Molla |
