| |
COMP348 Document Processing and the Semantic Web
Tutorial Week 5
Text Classification
You have data which consists of a set of sentences divided into two classes, Young and Old. Each sentence is
annotated with the appropriate letter (Y/O) and the sentence number. Example sentences are as follows:
Y1: i hope this wasn't for real ... its pathetic lol ... i bet 12 percent of the world would wanna smack u cats after they heard this garbage!!
Y2: omg you are so funny!!! I love ur video!!! ur the best!!!
O3: I refer to your email dated Wednesday 27 February, subject heading "Water Conservation -- 2008 plan".
Y4: That was the Best Video Ever!! Ken Lee Tulibu dibu douchoo Ken Lee ROFLMAO
O5: Dear Sir, I am writing to you about a Summer Internship. I am a postgraduate student at the IIT Kanpur, enrolled in a Bachelor of Engineering.
(Sources of data: Y1, Y2, edited comments from Youtube Ashkon: "Hot Tubbin'" -- OFFICIAL CUT
; Y4, edited comments from Youtube Ken Lee - Bulgarian Idol (WITH ENGLISH TRANSLATION)
.)
It has been suggested that there are two useful features for classifying the sentences into Young or Old:
count of capitalised words (Caps); and count of multiple punctuation marks and "blog words" such as lol, u/ur or ROFLMAO
(BlogWords). We will define Caps to include acronyms all in capitals. We will define BlogWords to include
multiple punctuation marks of the same type not separate by whitespace (e.g. !!!). Sentence Y1 then has
feature counts (Caps = 0, BlogWords = 5).
Use compile from Python's re module to define regular expressions that would
recognise these features. You can assume that the only BlogWords of interest are the ones in the example
sentences. Also give Python code to calculate the counts.
-
Assume a new unclassified sentence:
So, how about lunch? How are you placed for this Thursday? Let me know ... if you're not avoiding me, that is!!
Given as training data the example sentences above, use the k-Nearest Neighbours approach to classify it,
(1) for k = 1, and (2) for k = 3.
-
You can translate the integer-valued feature counts to binary ones by having the binary features indicate
the presence of capitalised words or of blog words. The binary feature vector for Y1 would then be (0, 1).
Given as training data the example sentences above, and given these binary feature vectors,
use Naive Bayes to classify the new sentence above.
-
The binarisation above can be viewed as imposing a threshold of 1: if the feature count is below 1, then
the binary feature has value 0; otherwise it has value 1. Assume we use instead a threshold of 2: if a feature
count is below 2, the binary feature has value 0; otherwise it has value 1.
Recalculate the Naive Bayes classification given this new binary representation.
Text Classification Accuracy
You have built a system for classifying documents into one of two classes, C1 or C2.
The system works by identifying and matching characteristics normally found in C1 documents,
and then classifying these as C1 and everything not matching as C2. It is quite accurate when
it does identify C1 documents; however, its matching rules do not find many of them. Its precision
rate for C1 documents is 90%; but in 1000 documents it only correctly classifies 90 as C1,
although it is known that the 1000 documents are split 60:40 between C1 and C2.
-
What is the precision of class C2? What is the overall accuracy of the system?
-
Is this a particularly good overall accuracy rate?
-
It has been suggested, instead of having the system class everything not identified as C1 (the "rejects") as C2, that it assign some proportion p of the rejects to class C1. Under what conditions might this be a good suggestion?
Information Gain and Mutual Information
You have documents constituting your training data divided into two classes, C1 (say, "medicine") and C2 (say, "sport"). There are 64 documents in C1 and 192 in C2. Of these documents, the numbers that contain particular words are given in the table below.
| |
C1 |
C2 |
| doctor |
16 |
3 |
| nurse |
12 |
1 |
| golf |
8 |
24 |
| ball |
2 |
96 |
-
Calculate the information gain for the term "doctor", i.e. G(doctor). You can just give the answer as an expression in logarithms.
-
Calculate G(golf). Which of the two terms would you select for a classification task?
-
Calculate the mutual information between the term "doctor" and class C1, i.e. I(doctor;C1), and between the term "golf" and class C1, i.e. I(golf;C1). Which term is preferable for classification?
Still More Regular Expressions
-
Using the re package in Python, write regular expression
functions that will return from a given string str:
-
all of the capitalised words
-
all of the words of length 3
-
all of the duplicate words (e.g. "the the" in "I went to the the park")
-
all of the acronyms
-
'French spacing' is the typographical practice of adding two spaces after
a full-stop.
Give a regular expression function that will replace French spacing in a string
by ordinary single spacing.
Recall and Precision/Accuracy Evaluation
You have a system which classifies a document as being written in one of four languages: English, Dutch, French or Spanish (E, D, F, or S respectively). The table below gives the number of documents classified into each language by the system, broken down by the actual language the documents are written in.
| |
|
act |
| sys |
|
|
E |
D |
F |
S |
| |
|
E |
50 |
30 |
15 |
5 |
| |
|
D |
10 |
25 |
3 |
2 |
| |
|
F |
16 |
4 |
48 |
12 |
| |
|
S |
10 |
5 |
15 |
20 |
-
What are the recall and precision rates for each language?
-
The languages can be divided into two groups, Germanic (English and Dutch) and Romance (French and Spanish).
- What are the accuracy rates for the groups Germanic and Romance?
- Can accuracy rates for a group be lower than for the individual components before combining?
- Can they drop below the average of the combined rates? (For example, can the Germanic accuracy rate drop below the average of English and Dutch accuracy rates?)
- Can they be equal to the average of the combined rates?
Mark Dras or
|