COMP249 Web Technology
Practical - Week 5
The aim of this practical is to introduce you to the Python language. You should make use of the resources available on the Web, such as Dive Into Python or the recommended text in attacking these problems. A short screencast on Python is also available via the lectures page. Work through the relevant chapters as appropriate before working on these problems in the practical session. It's important to get started with Python now since you'll need to be comfortable with it for the second assignment.
Counting Words
Counting words in a document is made easy in Python by the use of dictionaries. A dictionary holds a piece of data for each key; in this case the keys can be words and the data can be the count of the number of occurances.
Your program should read a file input.txt, count the words in the file and write out a report, one word per line with it's count. For example:
The: 10 cat: 3 the: 12 forge: 2 etc...
To complete this task you'll need to find out about splitting strings, file handling and dictionaries. The key to solving the problem is to read the file (one line at a time, readline, or all at once, read), split into words then either adding the new word to the dictionary with a count of 1 or incrementing the value already stored. You can use the dictionary has_key() method to check whether the word is already present, eg:
x = "Hello" if words.has_key(x): words[x] = words[x]+1 else: words[x] = 1
Extensions
Here are a number of extensions for the above problem to exercise some more Python machinery. Some of these would be good starting points for items in your Portfolio.
- Count words case-insensitively, convert words to lower case before incrementing their dictionary count.
- Output the report in alphabetical order or numerical order.
- Read the name of the input file from the command line. The
command line can be accessed via the argv array in the sys module, eg:
import sys filename = sys.argv[1]
- Output the report as an HTML page with the words and numbers in a table. Write the HTML to a file (find out how to write to files) so that you can view it in your web browser.
- Remove punctuation from the input before splitting and counting the words. Look at string.replace or (more advanced) the regular expression module.
- Convert part of your script into a procedure which takes a string and returns a dictionary containing the word count for the string. Rewrite your main script to use this procedure.
- Save your procedure to a seperate file and import it into your main script.
Make sure you are comfortable running Python scripts from the graphical interface (eg. IDLE) and from the command line on both Windows and Unix. Note that on Unix you will want to use the magic #! line as the first line of the script:
#!/usr/bin/python print "Hello World!"
This, along with making the file executable (chmod +x file.py) allows you to run a python script like a regular program (ie without mentioning python). Eg. if the above file is hello.py:
titanic:~:11 % chmod +x hello.py titanic:~:12 % ./hello.py Hello World!
Alternately, use the WinSCP interface to file permissions as outlined in the uploading screencast. Note that the above #! path (/usr/bin/python) will only work on Platypus, titanic has python installed in a different location (/share/bin/python) so you'll need to use python script.py to test things on titanic.