-
I am aware that the assignment is due on Sunday. Though, does that
mean we need to submit the hardcopy on Sunday too, or could we do that
Monday morning or some such?
Monday by noon for the hardcopy is fine.
-
Also, in regards to our file submission, you said that you will be
writing a script to run over classify_individual.py. So then, does
that mean you won't be using our copy of classify_corpus.py? Does that
mean I can't import data to use through classify_corpus, for
classify_individual, since your auto-marker won't be doing that?
I'll be using your copy of classify_corpus.py on the
data you have in order to see how it works, but I'll be using my own Python code to call
classify_individual.py over some new data. So you'll
need classify_individual.py to import data in some other
way (e.g. stored in a separate file which classify_individual.py reads).
-
More over, you said that the former file should take one argument,
which is to be the filename. I presume this is so that the automatic
marker can use it. But then if so, what function name are we meant to
be using for that?
I'll just be invoking your program with
classify_individual.py blogpost_filename
-
What output was our program meant to produce, if any? Was it meant
to be a text file, or just printed? What should be outputted, exactly?
Only classify_individual.py has a specified output (+1 or -1).
classify_corpus.py is just intended to determine the accuracy
of your classifier on the test set. It doesn't have any specified output format,
as its main purpose is to produce the accuracy figure you'll be using in your
report; just printing the accuracy to stdout is fine.
-
Are we allowed to have other files with our submission, other than
the specified .py files? i.e. text files containing data?
Yes. I've amended the assignment specs to include this.
-
Should we assume that the data to be tested will all be xml files,
which reside in the exact same folder that contains the py [and
possible text] files?
Yes, they'll all look like the files you've seen.
-
What is the 'rule-based' means?
I think my task is building a system that classify the blog posts by only some rules, not using statistical methods, for example, K-nearest neighbor.
Did I guess right?
Yes, that's right. Rule-based doesn't include kNN, so what you're suggesting is correct. (However,
see below.) Some examples of rules for a rule-based approach are in the spec:
IF blog_post uses "lol" THEN category is Young
IF number of occurrence of urlLink in blog_post > 5 THEN category is Old
-
In Assignment1 Part2, I have a problem on using
xml.dom.minidom.parse() to parse the XML files because of large number
of files (approx. 70-75%) are not well formed. The most common mistake
is like the following sample which contain "&". Should I ignore all
not well formed XML files or just use regular expression to extract
the content?
sample.xml
<Blog>
<date>1,January,2008</date>
<post>abc & 123</post>
</Blog>
It's up to you whether you want to use an XML parser; you don't have to if you
don't want to, or you can choose to use it only on files with well-formed XML.
It's also up to you whether you want to use all of the training
data; you could choose to use only the files with well-formed XML.
However, you do have to make sure you process all the test files.
-
I am using a K - Nearest Neighbours method to classify each category.
But practical supervisor said KNN is not very rule-based.
I am wondering if my method is OK or i will lose any mark by using this method.
It's true that kNN isn't a rule-based method; it's a supervised learning method.
The idea in Part 3 is to compare a simpler method to an SVM-based method, and a
rule-based method (which would involve processing the files, identifying useful
distinguishing features, etc) was meant to lead you up to choosing features for the SVM
approach of Part 3.
However, I don't mind if you're using kNN: you can just compare that with the SVM-based
method in Part 3.
-
I still could not figure out the structure of the files. Currently, I store test data in a
test_data folder, and training data in a training_data folder. Is this OK to do? Or, what I suppose
to do for the files, python code, test data and train data? (because in the specification we expect
to find in the files in the current directory)
You should expect to find the data files in the current directory. In your case, with your supervised
learning approach, you might have to have separate copies of your program in each directory. (I know that's ugly,
and for Part 3 there will be function arguments for separate training and testing spaces, but I wasn't expecting
anyone to do that for Part 2. I used the current directory requirement to be uniform
for the script I'll use for marking.)
-
With using KNN, I need to pass the 'filename' and a three-dimensional list (a list contains points,
and 'Young' or 'Old'). But in the specification, classify_individual.py should take a single file name
as an argument.
Yes, classify_individual.py should only take a single filename as an argument.
Any model that you build as part of a supervised learning approach should be stored in a separate file
not passed as an argument, and called from classify_individual.py.
(Again, this is just so that I have a consistent interface when I
call your file.) Note that SVM-light stores its model in a separate file.
When I run your program over test data, I'll be writing a script to run your classify_individual.py
over a new data set, and it will be expecting to find your code in the same directory as the data.
If you have a model file constructed from a supervised learning approach (or any other file -- e.g. a separate file
containing regular expressions for rules) it should also be in the same directory, and opened and called
by your classify_individual.py.
Make sure you mention this in your documentation so I know what's going on.