Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 2
 
 

COMP348 Document Processing and the Semantic Web

Practical Exercises, Week 2

Prologue

This first part is if you're new to Python, or wanting a bit of revision. Otherwise skip to the Exercises part.

I'll be assuming you're working under Windows

Running Python

You can run a Python interactive session either from the command line or via the IDE shell.

On Windows, select "PythonWin IDE" via the Start Menu and you'll get the same prompt in a window along with a bunch of other stuff.

Python Documentation

You have several locations for Python help documentation:

Trying out Python

To get up to speed on Python, you can work through either (or both) of Dive Into Python (chapters 1-7) or the python.org tutorial (all chapters).

Exercises

Class Exercises

Attempt the class exercises, here repeated:

  1. You're given data of the following form:
    namedat = dict()
    namedat['mc'] = ('Madonna', 45)
    namedat['sc'] = ('Steve', 41)
    How would you print out a list ordered by age?
    ('Steve', 41)
    ('Madonna', 45)
    
    Hint: Create a dictionary where the year is the key. But make sure that you can handle people with the same age. For example, suppose that your dictionary also has the following:
    namedat['tr'] = ('Tim', 41)
    
  2. Write a function to carry out selection sort on a list of numbers.
    def selectionSort (numList):
        """ performs selection sort on numList"""
    
    		 # INSERT CODE 
    		 return numList
    
  3. You want to compare a model solution file (model-X.txt) with a file containing an attempted solution in the same format (result-X.txt). The code should return the number of lines with differences. The data is (supposed to be) of the following form (model.txt):

    Document number 1: word accuracy rate is 35/60.
    Document number 2: word accuracy rate is 4/62.
    Document number 3: word accuracy rate is 1/9.
    

    Your code should handle cases both where the attempted solution is actually in the correct format (result-a.txt) and where there are some minor errors (result-b.txt).

    Fill in the code below

    def compare_correct(modelfilename, resultfilename):
    
        """ extracts correctly classified components from a specified format;
            does a comparison between model and result,
            and returns number of differences """
    
        modelfile = open(modelfilename)
        resultfile = open(resultfilename)
    
        modelline = modelfile.readline()
        resultline = resultfile.readline()
    
        num_diff = 0
        while len(modelline) > 0:
              
    	  # INSERT CODE HERE
      
            modelline = modelfile.readline()
            resultline = resultfile.readline()
    
        return num_diff
    
    
    if __name__ == "__main__":
        print compare_correct("model.txt", "result-a.txt")
        print compare_correct("model.txt", "result-b.txt")
        
    

Word Frequency

This exercise uses data taken from the Wall Street Journal, a common source of text in Natural Language Processing. To test the programs, use a set of files stored in this directory. These files are the result of tokenising these files and writing one token per line.

  1. Write a Python script count_tokens that prints the frequency of all the tokens in a list of files. Make sure that all the tokens are first converted into lowercase:

    % count_tokens 9405001.sent 9502005.sent
    labeled: 1
    up: 2
    head: 24
    pattern: 1
    necessarily: 1
    passive: 2
    us: 3
    observe: 1
    presentation: 1
    free: 2
    ...
    

    Hints:

    1. Use the string.lower() function form the string module to turn a string into lowercase.
    2. Use a dictionary to count the word frequencies. For example, the value frequency['the'] stores the frequency of the word 'the', and so on.
  2. Now, extend your program so that it prints out the 20 most frequent tokens, in descending order of frequency:

    % count_tokens 9405001.sent 9502005.sent
    the: 561
    of: 298
    to: 169
    in: 159
    a: 152
    is: 132
    for: 125
    and: 103
    ...
    

    To do this, you may want to define a subroutine by_key that, given the arguments a and b, returns the numerical comparison between frequency[a] and frequency[b]. Then you can use by_key as the sorting criterion to sort the hash keys.

  3. Extend the previous code so that the count as a proportion of total word counts is printed out in parentheses as well:

    % count_tokens 9405001.sent 9502005.sent
    the:	561	(8.75%)
    of:	298	(4.65%)
    to:	169	(2.64%)
    in:	159	(2.48%)
    a:	152	(2.37%)
    is:	132	(2.06%)
    for:	125	(1.95%)
    and:	103	(1.61%)
    ...
    

Some notes:

  • Working under Windows from the labs, you can map directly to the directory that contains the files. You should already have a drive (G:) that links to \\claudius\units; if not, use "My Computer" to map the drive. From there, choose comp348\html\resources.

  • With the Windows version, you'll probably want to write the results of your programs to a file.
  • Alternatively, again working under Windows, you can mimic Unix-style command-line invocations above by creating a small Python file that calls the Python programs and directs the output to a file. For example, create a file containing just this:

    import os
    os.system("count_tokens.py 9405001.sent 9502005.sent > test1.out")
    



Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J