Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Tutorials >> Tutorial Week 11
 
 

COMP348 Document Processing and the Semantic Web

Tutorial Week 11

Document Summarisation

Here is a sample text (Wikipedia, 20 May 2008):

Karen Spärck Jones FBA (26 August 1935 - 4 April 2007) was a British computer scientist.

Karen Spärck Jones was born in Huddersfield, Yorkshire, England. Her father was Owen Jones, a lecturer in chemistry, and her mother was Ida Spärck, a Norwegian who moved to Britain during World War II. Spärck Jones was educated at a grammar school and then Girton College, Cambridge from 1953 to 1956, reading History. Initially she became a school teacher.

She worked at Cambridge's Computer Laboratory from 1974, and retired in 2002, holding the post of Professor of Computers and Information. She continued to work in the Computer Laboratory until shortly before her death. Her main research interests, since the late 1950s, were natural language processing and information retrieval. One of her most important contributions was the concept of inverse document frequency (IDF) weighting in information retrieval, which she introduced in a 1972 paper. IDF is used in most search engines today, usually as part of the tf-idf weighting scheme.

Prof. Spärck Jones was a Fellow of the British Academy, of which she was Vice-President in 2000-02. She was also a Fellow of both the AAAI and the ECCAI and was President of the Association for Computational Linguistics in 1994. She received several awards for her research including the Gerard Salton Award (1988), the ASIS&T Award of Merit (2002), the ACL Lifetime Achievement Award (2004), the BCS Lovelace Medal (2007) and the ACM-AAAI Allen Newell Award (2007).

She was married to fellow Cambridge computer scientist Roger Needham until his death in 2003. She died at Willingham in Cambridgeshire.

  1. Extract the most important sentences according to the following criteria:
    1. Frequency-keyword. For your reference, here are the most frequent words in order. You must choose the keywords among them.
    2. 15 .            2 scientist    2 retrieval  
      14 ,		2 AAAI	     
      14 the		2 death	     
      11 was		2 which	     
      11 in		2 school     
      10 of		2 computer   
      8 and		2 research   
      8 a		2 most	     
      7 (		2 Karen	     
      6 She		2 Laboratory 
      5 Jones		2 Her	     
      5 Spärck	2 IDF	     
      5 -		2 from	     
      4 Award		2 until	     
      4 to		2 weighting  
      4 her		2 Fellow     
      3 she		2 information
      3 Cambridge	2 Computer   
      3 ),		2 President  
      3 at		2 British    
      3 )		2 for	     
      3 2007		2 2002	     
      
    3. Location
    4. Cues and Indicator phrases; which ones would you select?
  2. Compare the summaries that you have obtained and discuss how you could improve them.

Information Extraction

  1. Come up with a scenario where you would apply information extraction techniques.

Named Entity Recognition

Given the same text about the biography of Karen Spärck Jones:

  1. Identify all the main entities according to the MUC entity types.
  2. Identify the instances that would be easiest to detect automatically and explain what is needed to detect them.
  3. Identify the instances that would be hardest to detect automatically and explain why.
  4. Annotate the words in a manner that makes them suitable for a statistical classifier.

Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J