Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 9
 
 

COMP348 Document Processing and the Semantic Web

Practical, Week 9

Information Retrieval

Attempt at least questions 1 to 3
  1. The following program creates an inverted index of the NLTK Gutenberg corpus and saves it into a file "index.pickle". Use this index to find the documents that match the following boolean queries. For this you need to use Python's sets:
    1. Brutus OR Caesar
    2. Brutus AND NOT Caesar
    3. (Brutus AND Caesar) OR Calpurnia
  2. If you haven't done so in a previous practical exercise, write a function that computes the word frequency of all words in a given document. You will need this for the next exercises.
    def computeWordFrequency(docID):
      """Find the frequencies of all words in the document
    
    Return a dictionary with the words as keys and the frequencies as the
    values of the dictionary
    """
    
  3. Write a function that computes the tfidf of all words in all documents. Use the NLTK Gutenberg corpus for this.
    def computeTFIDF():
      """Return a dictionary with the TFIDF of each word in each document
    
    Each entry of this dictionary has the following keys and values:
      - Key: the document ID
      - Value: another dictionary with words for keys and the TFIDF for
      values
    A special dictionary entry stores the IDF of all words:
      - Key: 'idf'
      - Value: another dictionary with words for keys and the IDF for values
    """
    
  4. Now write a program that prints the words of a document with highest tf.idf score, ranked in descending order. Can you see any correlation between those words and the actual content words (i.e. non-stop words) of the documents?
    def returnHighestTFIDF(document,tfidf,topN=20):
      "Return the topN words with highest TFIDF score in the given document"
    
  5. Finally, write a function that takes as a parameter a string and returns the ID of the document that is most relevant according to the tf.idf measure and the cosine similarity.
    def retrieve(query,tfidf):
      "Retrieve the most relevant document"
    

 


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J