COMP348 Document Processing and the Semantic Web
Practical, Week 11
Information Extraction and Chunking
In this week's practical session you will experiment with a chunker
and study how a chunker can be used in the information extraction
process.
A chunker is a tool that is designed to identify simple syntactic
structures such as noun phrases and verb phrases given a tagged text
as input.
In contrast to a parser, a chunker does not try to construct deeply
nested syntactic structures. The main task of a chunker is to group
contiguous sequences of tokens into chunks. These chunks are helpful
for information extraction tasks and for describing and checking the
syntactic environment of words.
For this weeks practical seesion, please read Chapter 7
(Chunking) of the NLTK
documentation.
Your Task
-
In the practical session, please work through the Sections 7.1-7.4.1
of the NLTK documentation and try out all examples
dicussed there using Python.
Before you start experimenting with these examples, make sure that
-
the numerical Python module
NumPy
is installed on your machine;
- the CoNLL 2000 corpus which
contains 270k words of Wall Street Journal text with IOB tags
is available in the directory 'corpora/conll2000/train.txt' of
your NLTK installation.
- If you have enough time, try to modify the following input
sentence (discussed in Section 7.4.1 of the NLTK documentation)
text = '''
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP'''
and generate different NP chunks in treew format using the conversion
function chunk.conllstr2tree().
-
Try to extract different types of chunked text from the CoNLL 2000
corpus (you can find an example in Section 7.4.1 which explains how
you can do this).
After this practical session you should know:
- what a chunker is and what text format a chunker takes as input;
- what the difference is between a chunker and a parser;
- what tag patterns are;
- how you can do chunking with regular expressions;
- how chunking can be used for various information extraction tasks.
If you encounter any problems, please ask your practical supervisor
for help.
Mark Dras or
|