| |
COMP348 Document Processing and the Semantic Web
Practical, Week 6
Using SVM-Light
As discussed in last week's lectures, SVM-light is the SVM you'll be using in Assignment 1, Part 3.
The goal of this week's practical is to get familiar with it.
For the first part, you're just going to reproduce what I did in lectures,
applying the SVM to the SVM-light site's sample problem involving classifying Reuters articles as
either "corporate acquisitions" or not.
Downloading SVM-Light
To use SVM-light, you'll first have to get a copy. It consists of two stand-alone executables,
svm_learn and svm_classify. To use it under the lab Windows environment,
from within Python, the easiest thing to do is to download the Windows executables in a zipfile.
Then do the following:
-
Unzip the files somewhere convenient.
-
Get the example1 data from the SVM-light site, which
was used as the example in lectures. In case you can't untar it yourself in your Windows environment, here's a local zipfile
instead. Unzip it.
-
Read the Getting started: some Example Problems (Inductive SVM) section
(down the page) to see what it's supposed to do. The commands for learning and for classifying are:
svm_learn example1/train.dat example1/model
svm_classify example1/test.dat example1/model example1/predictions
-
You're going to want to run these from within Python. Here are two snippets of model code. Both use the subprocess module
(which you may not have come across: it's new since Python 2.4). The subprocess module lets you run processes, and
supersedes os. You can find more details on how to use it in the Python module documentation.
Here's the first version, run1.py. It writes to a file (so you can look at the contents of myoutfile
on the disk) and then reads the data back in to print it out.
import subprocess
myout = open("myoutfile", "w")
myprocess = subprocess.Popen(["svm_learn", "example1/train.dat", "example1/model"], stdout=myout)
myprocess.wait()
# necessary to make sure process has finished before accessing file below
myout.close()
myout2 = open("myoutfile")
myoutline = myout2.readline()
while len(myoutline) > 0:
print myoutline,
myoutline = myout2.readline()
myout2.close()
Here's the second and more compact version, run2.py, if you don't want to write to a file.
In this code subprocess.PIPE tells Python that the stdout attribute is a file object that provides output from the child process.
As a file object, you can access it the same way as if you used open on an actual file.
import subprocess
myprocess = subprocess.Popen(["svm_learn", "example1/train.dat", "example1/model"], stdout=subprocess.PIPE)
myprocess.wait()
#print myprocess.stdout
myout = myprocess.stdout
myoutline = myout.readline()
while len(myoutline) > 0:
myoutline = myoutline.strip()
print myoutline
myoutline = myout.readline()
myout.close()
-
Modify these code snippets to classify the data as well, by calling svm_classify from within Python and capturing the output
(either in a file, or through the stdout file object). Your program should output the accuracy rate for the classification.
(It's 97.67% for the test data.)
-
Make sure you actually understood what you've just been doing.
Transforming Data into SVM-Light Format
The aim of this exercise is to get familiar with producing the features and input format
that SVM-light requires. It's a continuation of last week's exercise on distinguishing
Dutch from English.
-
For this exercise, we're going to take as features the counts of the 10 most common letter triples from each
langage. We'll give the Dutch ones feature numbers 1 to 10, and the English ones 11 to 20.
If you did last week's exercise exactly the same way I did (i.e. used the same definitions: looking only at triples
of letters within words by splitting on whitespace, etc), you would have ended up with the same counts
you could see in part 1 of the question, stored in model.dat. If you got these same values, the 20 features
would be:
1 : een
2 : aar
3 : den
4 : gen
5 : der
6 : het
7 : oor
8 : ver
9 : nde
10 : van
11 : the
12 : and
13 : ing
14 : her
15 : for
16 : ent
17 : ith
18 : wit
19 : ere
20 : was
Write a function that produces this list.
-
For each sentence in the training corpus train.txt, produce a string in the SVM-light format. It should start with +1 if
the sentence is English, -1 if Dutch; it should then list (for non-zero counts) the feature number, followed
by a colon, followed by the feature count. For the first three sentences in the corpus:
D1: Jo werd het eerst wakker op den grauwen, schemerachtigen
D2: Kerstmorgen. Er hingen geen kousen bij den haard, en gedurende een paar
E3: Jo was the first to wake in the gray dawn of Christmas morning.
you should have the following strings:
-1 3:1 4:1 6:1
-1 1:2 2:2 3:1 4:2 9:1 13:1
+1 11:2 13:1 20:1
For the first sentence, I've italicised the matches of letter triples that give the counts:
D1: Jo werd het eerst wakker op den grauwen, schemerachtigen
Mark Dras or
|