| |
COMP348 Document Processing and the Semantic Web
Practical Exercises, Week 2
Prologue
This first part is if you're new to Python, or wanting
a bit of revision.
Otherwise skip to the
Exercises part.
I'll be assuming you're working under Windows
Running Python
You can run a Python interactive session either from the
command line or via the IDE shell.
On Windows, select "PythonWin IDE" via the Start Menu
and you'll get the same prompt in a window along with a
bunch of other stuff.
Python Documentation
You have several locations for Python help documentation:
Trying out Python
To get up to speed on Python, you can work through
either (or both) of Dive Into
Python (chapters 1-7) or the python.org
tutorial (all chapters).
Exercises
Class Exercises
Attempt the class exercises, here repeated:
- You're given data of the following form:
namedat = dict()
namedat['mc'] = ('Madonna', 45)
namedat['sc'] = ('Steve', 41)
How would you print out a list ordered by age?
('Steve', 41)
('Madonna', 45)
Hint: Create a dictionary where the year is the key. But make sure that you can handle people with the same age. For example, suppose that your dictionary also has the following:
namedat['tr'] = ('Tim', 41)
-
Write a function to carry out selection sort on a list of numbers.
def selectionSort (numList):
""" performs selection sort on numList"""
# INSERT CODE
return numList
-
You want to compare a model solution file (model-X.txt) with a file
containing an attempted solution in the same format (result-X.txt).
The code should return the number of lines with differences.
The data is (supposed to be) of the following form (model.txt):
Document number 1: word accuracy rate is 35/60.
Document number 2: word accuracy rate is 4/62.
Document number 3: word accuracy rate is 1/9.
Your code should handle cases both where the attempted solution is actually
in the correct format (result-a.txt) and where
there are some minor errors (result-b.txt).
Fill in the code below
def compare_correct(modelfilename, resultfilename):
""" extracts correctly classified components from a specified format;
does a comparison between model and result,
and returns number of differences """
modelfile = open(modelfilename)
resultfile = open(resultfilename)
modelline = modelfile.readline()
resultline = resultfile.readline()
num_diff = 0
while len(modelline) > 0:
# INSERT CODE HERE
modelline = modelfile.readline()
resultline = resultfile.readline()
return num_diff
if __name__ == "__main__":
print compare_correct("model.txt", "result-a.txt")
print compare_correct("model.txt", "result-b.txt")
Word Frequency
This exercise uses data taken from the Wall Street Journal, a common
source of text in Natural Language Processing.
To test the programs, use a set of files stored in this
directory.
These files are the result of tokenising these files
and writing one token per line.
-
Write a Python script count_tokens that
prints the frequency of all the tokens in a list of
files. Make sure that all the tokens are first
converted into lowercase:
% count_tokens 9405001.sent 9502005.sent
labeled: 1
up: 2
head: 24
pattern: 1
necessarily: 1
passive: 2
us: 3
observe: 1
presentation: 1
free: 2
...
Hints:
- Use the
string.lower() function form the string module to turn a string into lowercase.
- Use a dictionary to count the word
frequencies. For example, the value
frequency['the'] stores the frequency of
the word 'the', and so on.
-
Now, extend your program so that it prints out the
20 most frequent tokens, in descending order of
frequency:
% count_tokens 9405001.sent 9502005.sent
the: 561
of: 298
to: 169
in: 159
a: 152
is: 132
for: 125
and: 103
...
To do this, you may want to define a subroutine
by_key that, given the arguments
a and b, returns the
numerical comparison between frequency[a]
and frequency[b]. Then you can use
by_key as the sorting criterion to sort
the hash keys.
-
Extend the previous code so that the count as a proportion of total word counts
is printed out in parentheses as well:
% count_tokens 9405001.sent 9502005.sent
the: 561 (8.75%)
of: 298 (4.65%)
to: 169 (2.64%)
in: 159 (2.48%)
a: 152 (2.37%)
is: 132 (2.06%)
for: 125 (1.95%)
and: 103 (1.61%)
...
Some notes:
-
Working under Windows from the labs, you can map directly to the directory
that contains the files. You should already have a drive (G:) that links
to \\claudius\units; if not, use "My Computer" to map the drive. From there,
choose comp348\html\resources.
-
With the Windows version, you'll probably want to write the results of your programs
to a file.
-
Alternatively, again working under Windows, you can mimic Unix-style command-line invocations above
by creating a small Python file that calls the Python programs and directs the
output to a file. For example, create a file containing just this:
import os
os.system("count_tokens.py 9405001.sent 9502005.sent > test1.out")
Mark Dras or
|