Department of Computing

Local Navigation

COMP249 Web Technology

Practical - Week 11

I've included a task below on parsing HTML. In addition, since you have your XML assignment to work on and your portfolio to finalise in the next few weeks, you might want to take this time to get some advice from your prac tutor on what you might do.

Parsing HTML

We have seen in the lectures how to inherit the HTMLParser class to create classes for parsing HTML code and extracting information. For example, the following piece of code can be used to extract headers:

from HTMLParser import HTMLParser

class HeaderExtractor(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.context = ''
        self.text = ''
        self.headers = []

    def handle_starttag(self, tag, attrs):
        if tag in ['h1','h2','h3','h4','h5','h6']:
            self.context = tag

    def handle_endtag(self, tag):
        if tag in ['h1','h2','h3','h4','h5','h6']:
            self.headers += ['%s: %s' % (tag, self.text)]
            self.text = ''
            self.context = ''
            
    def handle_data(self, data):
        if self.context:
            self.text += data
        

if __name__ == "__main__":
    import urllib
    uri = raw_input("Give me an URI: ")
    u = urllib.urlopen(uri)
    html = u.read()
    u.close()
    h = HeaderExtractor()
    h.feed(html)
    print h.headers

Try out this example and if something is unclear, then please check the Python Documentation (13.1 HTMLParser -- Simple HTML and XHTML parser) or ask the practical supervisor.There are other Python classes that can be used for more ambitious parsing, but HTMLParser is easy to use and good enough for simple tasks. Below are two other tasks that you may try out:

  1. Write a class that finds all links specified in the "href" attribute of the <a> tag.
  2. Use the handle_data method to build a list of all of the words in the HTML file - refer back to your week 5 practical where you did word counts on plain text files.
  3. Extend this to build a simple inverted index that stores the filename that each word appears in. Your script will need to read more than one HTML file to make this useful.

Comments to: comp249-admin@ics.mq.edu.au

Copyright & Site information