Please note: You are viewing the unstyled version of this web site. Either your browser does not support CSS (cascading style sheets) or it has been disabled.

Department of Computing

Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 3
 
 

COMP348 Document Processing and the Semantic Web

Practical, Week 3

Tokenisation and Sentence Segmentation

Note that this week's prac is extra-long, since there's no prac next week because of Easter. You can complete it in your own time.

Tokenisation

Following is the output from the help for the MS-DOS dir command (also available here):

Displays a list of files and subdirectories in a directory.

DIR [drive:][path][filename] [/A[[:]attributes]] [/B] [/C] [/D] [/L] [/N]
  [/O[[:]sortorder]] [/P] [/Q] [/S] [/T[[:]timefield]] [/W] [/X] [/4]

  [drive:][path][filename]
              Specifies drive, directory, and/or files to list.

  /A          Displays files with specified attributes.
  attributes   D  Directories                R  Read-only files
               H  Hidden files               A  Files ready for archiving
               S  System files               -  Prefix meaning not
  /B          Uses bare format (no heading information or summary).
  /C          Display the thousand separator in file sizes.  This is the
              default.  Use /-C to disable display of separator.
  /D          Same as wide but files are list sorted by column.
  /L          Uses lowercase.
  /N          New long list format where filenames are on the far right.
  /O          List by files in sorted order.
  sortorder    N  By name (alphabetic)       S  By size (smallest first)
               E  By extension (alphabetic)  D  By date/time (oldest first)
               G  Group directories first    -  Prefix to reverse order
  /P          Pauses after each screenful of information.
  /Q          Display the owner of the file.
  /S          Displays files in specified directory and all subdirectories.
  /T          Controls which time field displayed or used for sorting
  timefield   C  Creation
              A  Last Access
              W  Last Written
  /W          Uses wide list format.
  /X          This displays the short names generated for non-8dot3 file
              names.  The format is that of /N with the short name inserted
              before the long name. If no short name is present, blanks are
              displayed in its place.
  /4          Displays four-digit years

Switches may be preset in the DIRCMD environment variable.  Override
preset switches by prefixing any switch with - (hyphen)--for example, /-W.

You can get it by typing dir /? at the Windows Run prompt.

Write a Python function that creates and prints out a dictionary with options as the key (e.g. /W) and their descriptions as the entry for that key.

Things to note:

  • consider what to do about multi-line explanations;
  • options appearing within descriptions shouldn't become indices;
  • ignore any text not relating to options and their descriptions.

Your output should then look something like:

>>> 
/4 : Displays four-digit years
/A : Displays files with specified attributes. attributes D Directories R Read-only files H Hidden files 
A Files ready for archiving S System files - Prefix meaning not
/B : Uses bare format (no heading information or summary).
/C : Display the thousand separator in file sizes.  This is the default. Use /-C to disable display of 
separator.
/D : Same as wide but files are list sorted by column.
/L : Uses lowercase.
/N : New long list format where filenames are on the far right.
/O : List by files in sorted order. sortorder N By name (alphabetic) S By size (smallest first) E By extension 
(alphabetic) D By date/time (oldest first) G Group directories first - Prefix to reverse order
/P : Pauses after each screenful of information.
/Q : Display the owner of the file.
/S : Displays files in specified directory and all subdirectories.
/T : Controls which time field displayed or used for sorting timefield C Creation A Last Access W Last Written
/W : Uses wide list format.
/X : This displays the short names generated for non-8dot3 file names. The format is that of /N with the short 
name inserted before the long name. If no short name is present, blanks are displayed in its place.
>>> 

NLTK Tutorial

The Natural Language Toolkit provides a bunch of very useful algorithms and classes for use in text processing. In this session we'll introduce the modules by working through their tutorial on tokenisation. It's already installed in the E6A labs. If you're working from home and you don't have the toolkit installed, you can download it from the above site and install it relatively easily. Installation instructions for Windows are here.

NLTK comes with a tutorial-style book, which you can find at http://nltk.org/index.php/Book. For this practical session, you should do the following:

  1. Read Chapter 1: Introduction.

  2. Read Chapter 2: Programming Fundamentals and Python. (You'll know this all, but it's a good refresher, and there are probably a few useful details I haven't covered in lectures.)

    Try out some of the examples by running them yourself in Python.

  3. Read Chapter 3: Elementary Language Processing, up to and including Section 3 (Tokenization).

    1. Try out the code in Listing 3.1.
    2. Do Exercise 3.2.4 (2), about counting tokens in Persuasion.
    3. See Listing 3.5, also given in lecture notes, on generating text in a particular style. Add some randomness, so that the text doesn't get stuck in a loop. One possible way to do this is to use the existing deterministic method if a random number is below a threshold, say 0.9; otherwise, choose a random word as the next one. You'll want to use the random module for this.

Conditional Probabilities

Write the code for count_pairs.py and bigram_cond_prob.py as discussed in the week 3 tutorial.

NLTK Regexp Tokenizer

Use NLTK to rewrite your word frequency count program (count_tokens) from Practical Week 2 . This time, you should use the raw data in this directory as input to the tokeniser.

Hints:

  1. You can read construct a reader that reads in text files in the same way as the corpus examples in the lecture notes. Use:
    reader = nltk.corpus.PlaintextCorpusReader(".", "9405001.sent")
    corpus = reader.words()
    
  2. PlaintextCorpusReader can take a third argument (check the api), where you specify a tokeniser instead of the default. Try out the regexp_tokenizer, as in the lecture notes.

Sentence Segmentation

Rule-based Segmentation

Implement the following context rule to determine a sentence ending:

IF (right context = period + space + capital letter
	OR period + quote + space + capital letter
	OR period + space + quote + capital letter)
THEN sentence boundary

In order to implement the context rule, write a Python function sentence_end that takes two arguments, an index and a list of tokens. The function will return 1 if the token in the index position is an end of sentence, and 0 otherwise. Note that the rule must cover the case that the index is pointing to the last token in the list (in which case the function must return 1).

Using your function, write a simple sentence segmenter which takes a list of tokens and returns a list of sentence boundary indexes. Extend this to return a list of lists of tokens -- each corresponding to a sentence. Eg:

>>> sentences(['This', 'is', 'one', '.', 'Named, '"', 'St.', 'James, '!', '"'])
[['This', 'is', 'one', '.'], ['Named, '"', 'St.', 'James, '!', '"']]

Probabilistic Segmentation

Using the NLTK framework, and in particular ConditionalFreqDist() as in the text generation example from the statistics lectures notes, write Python code that would calculate conditional probabilities to determine whether a period is an abbreviation or an end-of-sentence marker. You may assume that the training input for calculating your probabilities is a string with elements of the form word/symbol:

instr = "town/NN ./FS He/PRP went/VBD to/IN Victoria/NN \ 
St/NN ./ABBREVIATION where/ARB Dr/NN ./ABBREVIATION Smith/VBD ./FS"

Comments to: Mark Dras or Diego Molla

Computing | Division ICS | Macquarie University

Last Modified:
Copyright Macquarie University
CRICOS provider no. 00002J