| Computing >> CLT >> COMP348 home >> Tutorials >> Tutorial Week 3 |
COMP348 Document Processing and the Semantic WebTutorial Week 3Tokenisation and Sentence Segmentation (I)Here is a solution to last week's practical exercise on counting tokens:
def count_tokens(file, counts):
"""count the tokens occuring in the file which contains
one token per line, add them to the existing counts in the counts
dictionary.
Returns: a dictionary containing the tokens and their counts."""
infile = open(file)
line = infile.readline()
while len(line) > 0:
word = line.rstrip().lower()
if counts.has_key(word):
counts[word] += 1
else:
counts[word] = 1
line = infile.readline()
infile.close()
return counts
def report_tokens(counts):
"""display a report of the tokens in a frequency count"""
keys = counts.keys()
# use a lambda form to define the frequency ordering
keys.sort(lambda a,b: cmp(counts[b], counts[a]) )
# print out just the top 20 words
for i in range(20):
print keys[i] + ":\t" + str(counts[keys[i]])
import sys
if __name__ == '__main__':
counts = dict()
for arg in sys.argv[1:]:
print arg
count_tokens(arg,counts)
report_tokens(counts)
Evaluation Measures
Conditional ProbabilityBayes RuleWhat is the probability from the table below that someone is blond given that they have brown eyes?
CollocationsA collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Examples of collocations are strong tea (as opposed to powerful tea, even though strong and powerful are interchangeable in other contexts), World Wide Web, and red wine. Many specialised terms are in fact collocations, and therefore specialised terminology can be detected by comparing the relative frequency of a sequence of words across several domains.
Tokenisation and Sentence Segmentation (II)Consider the following text:
It is 0.025-in. long. A. lives in the U.S. John
Mackenzie Jr. lives in Dallas, Tex. This is a
fact. At 3.p.m. Continental finalized its
offer. Complaints should be sent to
Dr. White. He stopped at Meadows Dr. White
Falcon was still open. This happened at 3 p.m. Did Conti-
nental finalize its offer? "There is such a quantity
of unknown and instructive documents" -- H. A.
Taine, August 1875. The cost is $95.40 per average field
trip; John, pay attention! How is infection transmitted?
It is not transmitted from: giving blood/mosquito
bites/toilet seats/kissing/from normal day-to-day contact.
MorphologyMorphology in Dutch for nouns has some similarities with English.
How would you write rules for these for the Porter Stemmer, using the notation of the lecture notes? (Source of information about Dutch grammar of plural nouns: http://www.dutchgrammar.com.) Mark Dras or Diego Molla |
