Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 11
 
 

COMP348 Document Processing and the Semantic Web

Practical, Week 11

Information Extraction and Chunking

In this week's practical session you will experiment with a chunker and study how a chunker can be used in the information extraction process.

A chunker is a tool that is designed to identify simple syntactic structures such as noun phrases and verb phrases given a tagged text as input.

In contrast to a parser, a chunker does not try to construct deeply nested syntactic structures. The main task of a chunker is to group contiguous sequences of tokens into chunks. These chunks are helpful for information extraction tasks and for describing and checking the syntactic environment of words.

For this weeks practical seesion, please read Chapter 7 (Chunking) of the NLTK documentation.

Your Task

  • In the practical session, please work through the Sections 7.1-7.4.1 of the NLTK documentation and try out all examples dicussed there using Python.

    Before you start experimenting with these examples, make sure that

    • the numerical Python module NumPy is installed on your machine;
    • the CoNLL 2000 corpus which contains 270k words of Wall Street Journal text with IOB tags is available in the directory 'corpora/conll2000/train.txt' of your NLTK installation.

  • If you have enough time, try to modify the following input sentence (discussed in Section 7.4.1 of the NLTK documentation)
    
      text = '''
        he PRP B-NP
        accepted VBD B-VP
        the DT B-NP
        position NN I-NP
        of IN B-PP
        vice NN B-NP
        chairman NN I-NP
        of IN B-PP
        Carlyle NNP B-NP
        Group NNP I-NP
        , , O
        a DT B-NP
        merchant NN I-NP
        banking NN I-NP
        concern NN I-NP'''
    
    
    and generate different NP chunks in treew format using the conversion function chunk.conllstr2tree().
  • Try to extract different types of chunked text from the CoNLL 2000 corpus (you can find an example in Section 7.4.1 which explains how you can do this).

After this practical session you should know:

  • what a chunker is and what text format a chunker takes as input;
  • what the difference is between a chunker and a parser;
  • what tag patterns are;
  • how you can do chunking with regular expressions;
  • how chunking can be used for various information extraction tasks.

If you encounter any problems, please ask your practical supervisor for help.

 


Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J