| Computing >> CLT >> COMP348 home >> Practicals >> Practical Week 3 |
COMP348 Document Processing and the Semantic WebPractical, Week 3Tokenisation and Sentence SegmentationNote that this week's prac is extra-long, since there's no prac next week because of Easter. You can complete it in your own time. Tokenisation
Following is the output from the help for the MS-DOS
Displays a list of files and subdirectories in a directory.
DIR [drive:][path][filename] [/A[[:]attributes]] [/B] [/C] [/D] [/L] [/N]
[/O[[:]sortorder]] [/P] [/Q] [/S] [/T[[:]timefield]] [/W] [/X] [/4]
[drive:][path][filename]
Specifies drive, directory, and/or files to list.
/A Displays files with specified attributes.
attributes D Directories R Read-only files
H Hidden files A Files ready for archiving
S System files - Prefix meaning not
/B Uses bare format (no heading information or summary).
/C Display the thousand separator in file sizes. This is the
default. Use /-C to disable display of separator.
/D Same as wide but files are list sorted by column.
/L Uses lowercase.
/N New long list format where filenames are on the far right.
/O List by files in sorted order.
sortorder N By name (alphabetic) S By size (smallest first)
E By extension (alphabetic) D By date/time (oldest first)
G Group directories first - Prefix to reverse order
/P Pauses after each screenful of information.
/Q Display the owner of the file.
/S Displays files in specified directory and all subdirectories.
/T Controls which time field displayed or used for sorting
timefield C Creation
A Last Access
W Last Written
/W Uses wide list format.
/X This displays the short names generated for non-8dot3 file
names. The format is that of /N with the short name inserted
before the long name. If no short name is present, blanks are
displayed in its place.
/4 Displays four-digit years
Switches may be preset in the DIRCMD environment variable. Override
preset switches by prefixing any switch with - (hyphen)--for example, /-W.
You can get it by typing
Write a Python function that creates and prints out a dictionary with
options as the key (e.g. Things to note:
Your output should then look something like: >>> /4 : Displays four-digit years /A : Displays files with specified attributes. attributes D Directories R Read-only files H Hidden files A Files ready for archiving S System files - Prefix meaning not /B : Uses bare format (no heading information or summary). /C : Display the thousand separator in file sizes. This is the default. Use /-C to disable display of separator. /D : Same as wide but files are list sorted by column. /L : Uses lowercase. /N : New long list format where filenames are on the far right. /O : List by files in sorted order. sortorder N By name (alphabetic) S By size (smallest first) E By extension (alphabetic) D By date/time (oldest first) G Group directories first - Prefix to reverse order /P : Pauses after each screenful of information. /Q : Display the owner of the file. /S : Displays files in specified directory and all subdirectories. /T : Controls which time field displayed or used for sorting timefield C Creation A Last Access W Last Written /W : Uses wide list format. /X : This displays the short names generated for non-8dot3 file names. The format is that of /N with the short name inserted before the long name. If no short name is present, blanks are displayed in its place. >>> NLTK TutorialThe Natural Language Toolkit provides a bunch of very useful algorithms and classes for use in text processing. In this session we'll introduce the modules by working through their tutorial on tokenisation. It's already installed in the E6A labs. If you're working from home and you don't have the toolkit installed, you can download it from the above site and install it relatively easily. Installation instructions for Windows are here. NLTK comes with a tutorial-style book, which you can find at http://nltk.org/index.php/Book. For this practical session, you should do the following:
Conditional Probabilities
Write the code for NLTK Regexp TokenizerUse NLTK to rewrite your word frequency count program (count_tokens) from Practical Week 2 . This time, you should use the raw data in this directory as input to the tokeniser. Hints:
Sentence SegmentationRule-based SegmentationImplement the following context rule to determine a sentence ending: IF (right context = period + space + capital letter OR period + quote + space + capital letter OR period + space + quote + capital letter) THEN sentence boundary In order to implement the context rule, write a Python
function Using your function, write a simple sentence segmenter which takes a list of tokens and returns a list of sentence boundary indexes. Extend this to return a list of lists of tokens -- each corresponding to a sentence. Eg: >>> sentences(['This', 'is', 'one', '.', 'Named, '"', 'St.', 'James, '!', '"']) [['This', 'is', 'one', '.'], ['Named, '"', 'St.', 'James, '!', '"']] Probabilistic Segmentation
Using the NLTK framework, and in particular ConditionalFreqDist()
as in the text generation example from the statistics lectures notes,
write Python code that would calculate conditional probabilities to
determine whether a period is an abbreviation or an end-of-sentence
marker. You may assume that the training input for
calculating your probabilities is a string with elements of the form
instr = "town/NN ./FS He/PRP went/VBD to/IN Victoria/NN \ St/NN ./ABBREVIATION where/ARB Dr/NN ./ABBREVIATION Smith/VBD ./FS" Mark Dras or Diego Molla |
