| |
COMP348 Document Processing and the Semantic Web
Practical, Week 9
Information Retrieval
Attempt at least questions 1 to 3
- The following program creates
an inverted index of the NLTK Gutenberg corpus and saves it into a
file "index.pickle". Use this index to find the documents that match
the following boolean queries. For this you need to
use Python's
sets:
- Brutus OR Caesar
- Brutus AND NOT Caesar
- (Brutus AND Caesar) OR Calpurnia
- If you haven't done so in a previous practical exercise, write a
function that computes the word frequency of all words in a given
document. You will need this for the next exercises.
def computeWordFrequency(docID):
"""Find the frequencies of all words in the document
Return a dictionary with the words as keys and the frequencies as the
values of the dictionary
"""
- Write a function that computes the tfidf of
all words in all documents. Use the NLTK Gutenberg corpus for this.
def computeTFIDF():
"""Return a dictionary with the TFIDF of each word in each document
Each entry of this dictionary has the following keys and values:
- Key: the document ID
- Value: another dictionary with words for keys and the TFIDF for
values
A special dictionary entry stores the IDF of all words:
- Key: 'idf'
- Value: another dictionary with words for keys and the IDF for values
"""
- Now write a program that prints the words of a document with
highest tf.idf score, ranked in descending order. Can you see any
correlation between those words and the actual content words
(i.e. non-stop words) of the documents?
def returnHighestTFIDF(document,tfidf,topN=20):
"Return the topN words with highest TFIDF score in the given document"
- Finally, write a function that takes as a parameter a string and
returns the ID of the document that is most relevant according to
the tf.idf measure and the cosine similarity.
def retrieve(query,tfidf):
"Retrieve the most relevant document"
Mark Dras or
|