Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Assigments >> Assigment 1, Part 2 >> Assignment 1, Part 2: FAQ
 
 

COMP348 Document Processing and the Semantic Web

Assignment 1, Part 2:
FAQ

  • I am aware that the assignment is due on Sunday. Though, does that mean we need to submit the hardcopy on Sunday too, or could we do that Monday morning or some such?

    Monday by noon for the hardcopy is fine.

  • Also, in regards to our file submission, you said that you will be writing a script to run over classify_individual.py. So then, does that mean you won't be using our copy of classify_corpus.py? Does that mean I can't import data to use through classify_corpus, for classify_individual, since your auto-marker won't be doing that?

    I'll be using your copy of classify_corpus.py on the data you have in order to see how it works, but I'll be using my own Python code to call classify_individual.py over some new data. So you'll need classify_individual.py to import data in some other way (e.g. stored in a separate file which classify_individual.py reads).

  • More over, you said that the former file should take one argument, which is to be the filename. I presume this is so that the automatic marker can use it. But then if so, what function name are we meant to be using for that?

    I'll just be invoking your program with

    classify_individual.py blogpost_filename 
    
  • What output was our program meant to produce, if any? Was it meant to be a text file, or just printed? What should be outputted, exactly?

    Only classify_individual.py has a specified output (+1 or -1). classify_corpus.py is just intended to determine the accuracy of your classifier on the test set. It doesn't have any specified output format, as its main purpose is to produce the accuracy figure you'll be using in your report; just printing the accuracy to stdout is fine.

  • Are we allowed to have other files with our submission, other than the specified .py files? i.e. text files containing data?

    Yes. I've amended the assignment specs to include this.

  • Should we assume that the data to be tested will all be xml files, which reside in the exact same folder that contains the py [and possible text] files?

    Yes, they'll all look like the files you've seen.

  • What is the 'rule-based' means? I think my task is building a system that classify the blog posts by only some rules, not using statistical methods, for example, K-nearest neighbor. Did I guess right?

    Yes, that's right. Rule-based doesn't include kNN, so what you're suggesting is correct. (However, see below.) Some examples of rules for a rule-based approach are in the spec:

    IF blog_post uses "lol" THEN category is Young
    
    IF number of occurrence of urlLink in blog_post > 5 THEN category is Old
    
  • In Assignment1 Part2, I have a problem on using xml.dom.minidom.parse() to parse the XML files because of large number of files (approx. 70-75%) are not well formed. The most common mistake is like the following sample which contain "&". Should I ignore all not well formed XML files or just use regular expression to extract the content?

    sample.xml
    <Blog>
    <date>1,January,2008</date>
    <post>abc & 123</post>
    </Blog>
    

    It's up to you whether you want to use an XML parser; you don't have to if you don't want to, or you can choose to use it only on files with well-formed XML. It's also up to you whether you want to use all of the training data; you could choose to use only the files with well-formed XML. However, you do have to make sure you process all the test files.

  • I am using a K - Nearest Neighbours method to classify each category. But practical supervisor said KNN is not very rule-based. I am wondering if my method is OK or i will lose any mark by using this method.

    It's true that kNN isn't a rule-based method; it's a supervised learning method. The idea in Part 3 is to compare a simpler method to an SVM-based method, and a rule-based method (which would involve processing the files, identifying useful distinguishing features, etc) was meant to lead you up to choosing features for the SVM approach of Part 3. However, I don't mind if you're using kNN: you can just compare that with the SVM-based method in Part 3.

  • I still could not figure out the structure of the files. Currently, I store test data in a test_data folder, and training data in a training_data folder. Is this OK to do? Or, what I suppose to do for the files, python code, test data and train data? (because in the specification we expect to find in the files in the current directory)

    You should expect to find the data files in the current directory. In your case, with your supervised learning approach, you might have to have separate copies of your program in each directory. (I know that's ugly, and for Part 3 there will be function arguments for separate training and testing spaces, but I wasn't expecting anyone to do that for Part 2. I used the current directory requirement to be uniform for the script I'll use for marking.)

  • With using KNN, I need to pass the 'filename' and a three-dimensional list (a list contains points, and 'Young' or 'Old'). But in the specification, classify_individual.py should take a single file name as an argument.

    Yes, classify_individual.py should only take a single filename as an argument. Any model that you build as part of a supervised learning approach should be stored in a separate file not passed as an argument, and called from classify_individual.py. (Again, this is just so that I have a consistent interface when I call your file.) Note that SVM-light stores its model in a separate file.

    When I run your program over test data, I'll be writing a script to run your classify_individual.py over a new data set, and it will be expecting to find your code in the same directory as the data. If you have a model file constructed from a supervised learning approach (or any other file -- e.g. a separate file containing regular expressions for rules) it should also be in the same directory, and opened and called by your classify_individual.py. Make sure you mention this in your documentation so I know what's going on.


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J