| |
COMP348 Document Processing and the Semantic Web
Tutorial Week 9
Word Sense Disambiguation
The following exercise is based on part of a question of a previous
exam. The sentences below are the only sentences that contain the
word chip in a small corpus. They are grouped according to
the sense of the word chip they contain:
- Sense 1
-
- The CPU or the Central Processing Unit is the brain of
the computer and the single most important chip in the
computer.
- As previously reported, the new Pentium III chip falls
into the SpeedStep family of Intel processors.
- This design, called Slot 1, will connect the chip to the
computer through a slot instead of the traditional pin
structure.
- Sense 2
-
- It's out there somewhere -- the ultimate chocolate chip
cookie, perfectly baked, perfectly formed, heaped with divine
chocolate pieces.
- The recipe was for something called Beatles, which were
potato chip cookies with chocolate topping.
-
For each occurrence of the word chip, write
down the words that appear in a 7-word context window.
-
Write down the contexts of each occurrence of the
word chip as 6-element vectors where each element
represents the occurrences or not of each of the following
words:
computer, pentium, chocolate, potato, cookie.
-
If the vector is called v, what are the values of
P(s1), P(s2), P(v[0] = 1 | s1), P(v[3] = 1 | s2)?
-
What is the most likely sense of chip in the
sentence The extra Pentium chip enables cookies to be
accessed particularly quickly, according to the Naive
Bayes approach? What does your calculation tell you about
using the straight Naive Bayes?
Information Retrieval
Given the following documents:
- D1
- Dating back to the 1950s, Machine translation is the oldest
language technology application.
- D2
- Programming languages are used to program machines to do their tasks.
- D3
- Human languages are notoriously difficult to process by machines.
- Write the inverted index of the
words machine, technology, program human,
and language, and run the boolean query (machine OR
human) AND NOT program. For this exercise, ignore word
inflections. So the words "machine" and "machines" match the same
keyword.
- Build the document vectors of the three documents using the tf.idf
term weighting
- Apply the cosine similarity to determine what is the document most
relevant to the query machine language. What problems did
you encounter and how could you solve them?
Mark Dras or
|