Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 5
 
 

COMP348 Document Processing and the Semantic Web

Practical, Week 5

Regular Expressions in Classification

Given the Young/Old classification problem in this week's tutorial exercises, write a Python program that will calculate the counts for the Capitals and BlogWords features. The five-sentence data set is in sentences.txt:

Y1: i hope this wasn't for real ... its pathetic lol ... i bet 12 percent of the world would wanna smack u cats after they heard this garbage!!
Y2: omg you are so funny!!! I love ur video!!! ur the best!!!
O3: I refer to your email dated Wednesday 27 February, subject heading "Water Conservation -- 2008 plan".
Y4: That was the Best Video Ever!! Ken Lee Tulibu dibu douchoo Ken Lee ROFLMAO
O5: Dear Sir, I am writing to you about a Summer Internship.  I am a postgraduate student at the IIT Kanpur, enrolled in a Bachelor of Engineering.

(Sources of data: Y1, Y2, edited comments from Youtube Ashkon: "Hot Tubbin'" -- OFFICIAL CUT ; Y4, edited comments from Youtube Ken Lee - Bulgarian Idol (WITH ENGLISH TRANSLATION) .)

Have the program print out the feature values as follows (where cm and bwn are the appropriate respective feature values for the sentences):

Y1 (0, 5)
Y2 (c2, bw2)
O3 (c3, bw3)
Y4 (c4, bw4)
O5 (c5, bw5)

A Small Classification System

The aim of this is to build a system to decide whether a particular line of text in a file is in English or Dutch, using a given simple algorithm, and then to evaluate the accuracy for a range of parameters.

The data consists of randomly interleaved lines from English and Dutch versions of Little Women. Each line is annotated at the start with either "E: " or "D: " depending on the language. The training data to use to build your system is train.txt:

D1: Jo werd het eerst wakker op den grauwen, schemerachtigen
D2: Kerstmorgen. Er hingen geen kousen bij den haard, en gedurende een paar
E3: Jo was the first to wake in the gray dawn of Christmas morning.
E4: No stockings hung at the fireplace, and for a moment she
E5: felt as much disappointed as she did long ago, when her little
D6: minuten voelde ze zich even teleurgesteld, als toen, jaren geleden,
D7: haar kleine kous op den grond viel, omdat die zoo volgestopt was met
E8: sock fell down because it was crammed so full of goodies.  Then
        

An obvious approach would be to look up words in a dictionary. In the absence of a dictionary, we'll try something else. Different languages have different frequencies of letter combinations: for example, aar is much more common in Dutch than in English. What your program should do, then, is to count triples of letter combinations, and use that to guess whether a line in the test file test.txt is in English or Dutch.

Your program should have the following components:

  1. A function build_tri_letter("train.txt", "model.dat"), which reads in the training data from train.dat, counts triples of letters for each language, and stores the result in file model.dat as follows:

    D een 167
    D aar 163
    D den 122
    D gen 103
    D der 101
    D het 98
    ...
    D ina 1
    D eop 1
    E the 321
    E and 227
    E ing 134
    E her 109
    E for 67
    E ent 58
    ...
            
    (This ability to save the results of training to a file is useful so that they don't need to be constantly recalculated for different test data.)

  2. A function read_model("model.dat") that can read the data back from model.dat into a dictionary (or other appropriate data structure).

  3. A function guess_lang(N, "test.txt", "test-guess.txt") which takes as its first parameter N the number of triples from each language to be used in guessing which language each sentence in the file given by its second parameter test.txt. The guess should just be a simple majority vote: if the line has more triples in the top N English triples than in the top N Dutch, it should be marked as English, and vice versa. Choose one of English or Dutch as a default in case of a tie. The guess should be then written to a file test-guess.txt as a single letter "D" or "E" in square brackets, along with the original sentence, in the following format:

    ...
    [E] E27: now what shall we wear?"
    [E] D28: niet aan haar stoorde. Toen Meta kwam, trok Knabbelaar zich in haar
    [D] D30: hol terug. Jo veegde haar tranen af en wachtte geduldig op het nieuws.
    [D] D32: "Verbeeld je eens, hoe heerlijk, een invitatiekaart van mevrouw
    [D] E33: "What's the use of asking that, when you know we shall wear
    ...
            
    (This allows you to inspect the individual results of the algorithm.)

  4. A function calc_acc("test-guess.txt") which calculates the overall accuracy rate for the algorithm.

 


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J