Solutions week 9 =============== Word Sense Disambiguation ------------------------- 1. Context window: - single most important chip in the computer - new Pentium III chip falls into the - will connect the chip to the computer - the ultimate chocolate chip cookie perfectly baked - which were potato chip cookies with chocolate 2. Vectors: - (1, 1, 0, 0, 0) - (0, 1, 0, 0, 0) - (1, 0, 0, 0, 0) - (0, 0, 1, 0, 1) - (0, 0, 1, 1, 1) 3. P(s1) = 3/5, P(s2) = 2/5, P(v[0]=1|s1) = 2/3, P(v[3] = 1|s2) = 1 4. sense = argmax P(s) P(v|s) New data: (0,1,0,0,1) - s1: 3/5 * 1/3 * 2/3 * 1 * 1 * 0 = 0 - s2: 2/5 * 1 * 0 * 0 * 1/2 * 1 = 0 Need some smoothing (avoid zeroes). This way the first sense would be selected. Information Retrieval --------------------- 1. machine = {D1,D2,D3} technology = {D1} program = {D2} human = {D3} language = {D1,D2,D3} (machine OR human) AND NOT program = {{D1,D2,D3} JOIN {D3}} - {D2} = = {D1,D2,D3} - {D2} = {D1,D3} 2. D1 = (1log(3/3), 1log(3/1), 0 , 0 , 1log(3/3)) D2 = (1log(3/3), 0 , 1log(3/1), 0 , 1log(3/3)) D3 = (1log(3/3), 0, , 0, , 1log(3/1), 1log(3/3)) 3. Since the tf.idf of "machine" and "language" are 0 no documents are retrieved. To solve this, consider smoothing all 0 values to some small values. For example, IDF can change from log(N/n) to log((N+1)/n). Then the query is: Q = (1log(4/3), 0, 0, 0, 1log(4/3) And the new D1 to D3: D1 = (1log(4/3), 1log(4/1), 0 , 0 , 1log(4/3)) D2 = (1log(4/3), 0 , 1log(4/1), 0 , 1log(4/3)) D3 = (1log(4/3), 0, , 0, , 1log(4/1), 1log(4/3)) The scores are computed after applying the formula.