Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 6
 
 

COMP348 Document Processing and the Semantic Web

Practical, Week 6

Using SVM-Light

As discussed in last week's lectures, SVM-light is the SVM you'll be using in Assignment 1, Part 3. The goal of this week's practical is to get familiar with it. For the first part, you're just going to reproduce what I did in lectures, applying the SVM to the SVM-light site's sample problem involving classifying Reuters articles as either "corporate acquisitions" or not.

Downloading SVM-Light

To use SVM-light, you'll first have to get a copy. It consists of two stand-alone executables, svm_learn and svm_classify. To use it under the lab Windows environment, from within Python, the easiest thing to do is to download the Windows executables in a zipfile. Then do the following:

  1. Unzip the files somewhere convenient.

  2. Get the example1 data from the SVM-light site, which was used as the example in lectures. In case you can't untar it yourself in your Windows environment, here's a local zipfile instead. Unzip it.

  3. Read the Getting started: some Example Problems (Inductive SVM) section (down the page) to see what it's supposed to do. The commands for learning and for classifying are:

    svm_learn example1/train.dat example1/model
    svm_classify example1/test.dat example1/model example1/predictions
    
  4. You're going to want to run these from within Python. Here are two snippets of model code. Both use the subprocess module (which you may not have come across: it's new since Python 2.4). The subprocess module lets you run processes, and supersedes os. You can find more details on how to use it in the Python module documentation.

    Here's the first version, run1.py. It writes to a file (so you can look at the contents of myoutfile on the disk) and then reads the data back in to print it out.

    import subprocess
    
    myout = open("myoutfile", "w")
    myprocess = subprocess.Popen(["svm_learn", "example1/train.dat", "example1/model"], stdout=myout)
    myprocess.wait()
    # necessary to make sure process has finished before accessing file below
    myout.close()
    
    myout2 = open("myoutfile")
    myoutline = myout2.readline()
    
    while len(myoutline) > 0:
        print myoutline,
    
        myoutline = myout2.readline()
    
    myout2.close()
    

    Here's the second and more compact version, run2.py, if you don't want to write to a file. In this code subprocess.PIPE tells Python that the stdout attribute is a file object that provides output from the child process. As a file object, you can access it the same way as if you used open on an actual file.

    import subprocess
    
    myprocess = subprocess.Popen(["svm_learn", "example1/train.dat", "example1/model"], stdout=subprocess.PIPE)
    myprocess.wait()
    #print myprocess.stdout
    
    myout = myprocess.stdout
    myoutline = myout.readline()
    
    while len(myoutline) > 0:
        myoutline = myoutline.strip()
        print myoutline
    
        myoutline = myout.readline()
    
    myout.close()
    
  5. Modify these code snippets to classify the data as well, by calling svm_classify from within Python and capturing the output (either in a file, or through the stdout file object). Your program should output the accuracy rate for the classification. (It's 97.67% for the test data.)

  6. Make sure you actually understood what you've just been doing.

Transforming Data into SVM-Light Format

The aim of this exercise is to get familiar with producing the features and input format that SVM-light requires. It's a continuation of last week's exercise on distinguishing Dutch from English.

  1. For this exercise, we're going to take as features the counts of the 10 most common letter triples from each langage. We'll give the Dutch ones feature numbers 1 to 10, and the English ones 11 to 20.

    If you did last week's exercise exactly the same way I did (i.e. used the same definitions: looking only at triples of letters within words by splitting on whitespace, etc), you would have ended up with the same counts you could see in part 1 of the question, stored in model.dat. If you got these same values, the 20 features would be:

    1 : een
    2 : aar
    3 : den
    4 : gen
    5 : der
    6 : het
    7 : oor
    8 : ver
    9 : nde
    10 : van
    11 : the
    12 : and
    13 : ing
    14 : her
    15 : for
    16 : ent
    17 : ith
    18 : wit
    19 : ere
    20 : was
    

    Write a function that produces this list.

  2. For each sentence in the training corpus train.txt, produce a string in the SVM-light format. It should start with +1 if the sentence is English, -1 if Dutch; it should then list (for non-zero counts) the feature number, followed by a colon, followed by the feature count. For the first three sentences in the corpus:

    D1: Jo werd het eerst wakker op den grauwen, schemerachtigen
    D2: Kerstmorgen. Er hingen geen kousen bij den haard, en gedurende een paar
    E3: Jo was the first to wake in the gray dawn of Christmas morning.
    

    you should have the following strings:

    -1 3:1 4:1 6:1 
    -1 1:2 2:2 3:1 4:2 9:1 13:1 
    +1 11:2 13:1 20:1 
    

    For the first sentence, I've italicised the matches of letter triples that give the counts:

    D1: Jo werd het eerst wakker op den grauwen, schemerachtigen
    

 


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J