| Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 5 |
COMP348 Document Processing and the Semantic WebPractical, Week 5Regular Expressions in ClassificationGiven the Young/Old classification problem in this week's tutorial exercises, write a Python program that will calculate the counts for the Capitals and BlogWords features. The five-sentence data set is in sentences.txt: Y1: i hope this wasn't for real ... its pathetic lol ... i bet 12 percent of the world would wanna smack u cats after they heard this garbage!! Y2: omg you are so funny!!! I love ur video!!! ur the best!!! O3: I refer to your email dated Wednesday 27 February, subject heading "Water Conservation -- 2008 plan". Y4: That was the Best Video Ever!! Ken Lee Tulibu dibu douchoo Ken Lee ROFLMAO O5: Dear Sir, I am writing to you about a Summer Internship. I am a postgraduate student at the IIT Kanpur, enrolled in a Bachelor of Engineering. (Sources of data: Y1, Y2, edited comments from Youtube Ashkon: "Hot Tubbin'" -- OFFICIAL CUT ; Y4, edited comments from Youtube Ken Lee - Bulgarian Idol (WITH ENGLISH TRANSLATION) .) Have the program print out the feature values as follows (where cm and bwn are the appropriate respective feature values for the sentences): Y1 (0, 5) Y2 (c2, bw2) O3 (c3, bw3) Y4 (c4, bw4) O5 (c5, bw5) A Small Classification SystemThe aim of this is to build a system to decide whether a particular line of text in a file is in English or Dutch, using a given simple algorithm, and then to evaluate the accuracy for a range of parameters. The data consists of randomly interleaved lines from English and Dutch versions of Little Women. Each line is annotated at the start with either "E: " or "D: " depending on the language. The training data to use to build your system is train.txt:
D1: Jo werd het eerst wakker op den grauwen, schemerachtigen
D2: Kerstmorgen. Er hingen geen kousen bij den haard, en gedurende een paar
E3: Jo was the first to wake in the gray dawn of Christmas morning.
E4: No stockings hung at the fireplace, and for a moment she
E5: felt as much disappointed as she did long ago, when her little
D6: minuten voelde ze zich even teleurgesteld, als toen, jaren geleden,
D7: haar kleine kous op den grond viel, omdat die zoo volgestopt was met
E8: sock fell down because it was crammed so full of goodies. Then
An obvious approach would be to look up words in a dictionary. In the absence of a dictionary, we'll try something else. Different languages have different frequencies of letter combinations: for example, aar is much more common in Dutch than in English. What your program should do, then, is to count triples of letter combinations, and use that to guess whether a line in the test file test.txt is in English or Dutch. Your program should have the following components:
Mark Dras or Diego Molla |
