Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Tutorials >> Tutorial Week 3
 
 

COMP348 Document Processing and the Semantic Web

Tutorial Week 3

Tokenisation and Sentence Segmentation (I)

Here is a solution to last week's practical exercise on counting tokens:

def count_tokens(file,  counts):
    """count the tokens occuring in the file which contains
    one token per line, add them to the existing counts in the counts
    dictionary.
    Returns: a dictionary containing the tokens and their counts."""
    
    infile = open(file)
    line = infile.readline()
    while len(line) > 0:
        word = line.rstrip().lower()
        if counts.has_key(word):
            counts[word] += 1
        else:
            counts[word] = 1
        line = infile.readline()

    infile.close()
    return counts

def report_tokens(counts):
    """display a report of the tokens in a frequency count"""
    keys = counts.keys()

    # use a lambda form to define the frequency ordering
    keys.sort(lambda a,b: cmp(counts[b], counts[a]) )

    # print out just the top 20 words
    for i in range(20):
        print keys[i] + ":\t" + str(counts[keys[i]])

import sys

if __name__ == '__main__':
    counts = dict()
    for arg in sys.argv[1:]:
        print arg
        count_tokens(arg,counts)
    report_tokens(counts)
  1. The output contains a lot of words that are probably uninteresting, such as the and of; these are often called stopwords. How might you get rid of these?

Evaluation Measures

  1. You have a system that classifies emails into one of two classes, work or other. A table giving the classifications is as follows:

    system         actual work other
    work  20  60
    other 5  15

    What is the accuracy of the classifier? Discuss also what precision and recall might mean here.

  2. Assume you have a three-way classifier, into work, study or other:

    system         actual work study other
    work  15  20  30
    study 5 0  5
    other 5 8  12

    What would be a reasonable way to extend the notion of accuracy?

Conditional Probability

Bayes Rule

What is the probability from the table below that someone is blond given that they have brown eyes?

Eyes         Hair Black Brunette Red Blonde
Brown  68  20  15   5
Blue 119  84  54  29

Collocations

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Examples of collocations are strong tea (as opposed to powerful tea, even though strong and powerful are interchangeable in other contexts), World Wide Web, and red wine. Many specialised terms are in fact collocations, and therefore specialised terminology can be detected by comparing the relative frequency of a sequence of words across several domains.

  1. Using the program count_tokens.py as a base, discuss how you would modify it to write a Python script count_pairs.py (assuming similar data of one word per line) that computes the 20 most frequent pairs of tokens occurring in the text:

    % count_pairs 9405001.sent 9502005.sent
    of the: 110
    in the: 38
    of a: 23
    to the: 20
    evaluation of: 20
    the evaluation: 19
    the head: 19
    for the: 18
    can be: 16
    in a: 16
    similarity model: 16
    ...
    
  2. In the example above the only likely collocation is similarity model, which appears rather down in the list. This is so because of the high frequency of function words such as the, a, in, etc. Discuss how you would modify your script so that it prints the 20 most frequent pairs of words that do not contain any of the words in the list stored in stop_words. Decide on a list of stop words to include in your list. For example, have a look at the list of most frequent words (use the program above), and decide which ones of those words should be included in your list of stop words.

  3. Discuss how you would write a program bigram_cond_prob.py that calculates the bigram conditional probability of a word given the immediately preceding word. For each word in the text, print out its most likely predecessor. (For this, you should assume an empty stopword list.)

    What do you think will be likely values generated by bigram_cond_prob.py for the tokenised files used in the practical classes last week?

Tokenisation and Sentence Segmentation (II)

Consider the following text:

It is 0.025-in. long. A. lives in the U.S. John 
Mackenzie Jr. lives in Dallas, Tex. This is a 
fact. At 3.p.m. Continental finalized its 
offer. Complaints should be sent to 
Dr. White. He stopped at Meadows Dr. White 
Falcon was still open. This happened at 3 p.m. Did Conti-
nental finalize its offer?  "There is such a quantity 
of unknown and instructive documents" -- H. A.
Taine, August 1875. The cost is $95.40 per average field 
trip; John, pay attention! How is infection transmitted?    
It is not  transmitted from: giving blood/mosquito 
bites/toilet seats/kissing/from normal day-to-day contact.
            
  1. Highlight all the tokens that may be difficult to identify automatically.
  2. Highlight all the sentence delimiters that may be difficult to identify.
  3. In groups of three or four, discuss your highlighted data, why they are problematic, and what is required to automatically handle them.
  4. Come up with algorithms to solve those cases. These algorithms need not be very detailed.

 

Morphology

Morphology in Dutch for nouns has some similarities with English.

  1. There are some "exceptions" which are nevertheless fairly regular:

    singular plural
    het museum de musea
    het visum de visa

  2. The "default" plural is -en. When the vowel before a single final consonant is short (written with one vowel letter), the consonant is doubled:

    singular plural
    de rok de rokken
    het geval de gevallen

  3. When the vowel before a single final consonant is long (written with two vowel letters), the consonant remains single and the vowel is shortened:

    singular plural
    de peer de peren
    het gevaar de gevaren

  4. In some particular cases, -s is used as the plural. One of these is for titles ending in oor:

    singular plural
    de majoor de majoors
    de pastoor de pastoors

How would you write rules for these for the Porter Stemmer, using the notation of the lecture notes?

(Source of information about Dutch grammar of plural nouns: http://www.dutchgrammar.com.)


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J