Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Tutorials >> Tutorial Week 9
 
 

COMP348 Document Processing and the Semantic Web

Tutorial Week 9

Word Sense Disambiguation

The following exercise is based on part of a question of a previous exam. The sentences below are the only sentences that contain the word chip in a small corpus. They are grouped according to the sense of the word chip they contain:

Sense 1
  1. The CPU or the Central Processing Unit is the brain of the computer and the single most important chip in the computer.
  2. As previously reported, the new Pentium III chip falls into the SpeedStep family of Intel processors.
  3. This design, called Slot 1, will connect the chip to the computer through a slot instead of the traditional pin structure.
Sense 2
  1. It's out there somewhere -- the ultimate chocolate chip cookie, perfectly baked, perfectly formed, heaped with divine chocolate pieces.
  2. The recipe was for something called Beatles, which were potato chip cookies with chocolate topping.
  1. For each occurrence of the word chip, write down the words that appear in a 7-word context window.

  2. Write down the contexts of each occurrence of the word chip as 6-element vectors where each element represents the occurrences or not of each of the following words:

    computer, pentium, chocolate, potato, cookie.
  3. If the vector is called v, what are the values of P(s1), P(s2), P(v[0] = 1 | s1), P(v[3] = 1 | s2)?

  4. What is the most likely sense of chip in the sentence The extra Pentium chip enables cookies to be accessed particularly quickly, according to the Naive Bayes approach? What does your calculation tell you about using the straight Naive Bayes?

Information Retrieval

Given the following documents:

D1
Dating back to the 1950s, Machine translation is the oldest language technology application.
D2
Programming languages are used to program machines to do their tasks.
D3
Human languages are notoriously difficult to process by machines.
  1. Write the inverted index of the words machine, technology, program human, and language, and run the boolean query (machine OR human) AND NOT program. For this exercise, ignore word inflections. So the words "machine" and "machines" match the same keyword.
  2. Build the document vectors of the three documents using the tf.idf term weighting
  3. Apply the cosine similarity to determine what is the document most relevant to the query machine language. What problems did you encounter and how could you solve them?

Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J