<?xml version = "1.0"?>
<wines>
<wine grape = "chardonnay">
<product> Carneros </product>
<year> 2002 </year>
<price> 12.00 </price>
</wine>
<wine grape = "merlot">
...
</wine>
</wines>
File -> Tokeniser -> Parser -> Internal Form
An XML parser reads text with XML markup.
Delivers XML content to an application program.
Parsers usually depend on the well formedness of a document. Some can determine the validity (relative to DTD or Schema).
There are two common models of interaction between the parser and your program:
SAX: Simple API for XML
DOM: the Document Object Model
Other ways of handling XML: XML Data binding (mainly Java but see this page ) etc.
SAX provides an event based interface to the parser.
User callbacks are associated with events for:
Start Tag
End Tag
Character data (text)
Etc.
SAX reads the XML document sequentially from start to finish, and along the way will invoke various callback methods when particular events occur.
A callback is a method registered with the parser, written to enable the code to respond to events of interest to the programmer.
SAX provides only sequential access to the XML document.
SAX is fast and requires very little memory.
This interface is good for large XML files where the processing required is linear.
Three kinds of objects: readers, handlers and input source.
The reader (= parser) reads the characters from the input source, and produces a sequence of events.
The events get distributed to the handler objects (i.e. the parser invokes a method on the handler).
As the final step of preparation, the parser is called to parse the input.
During parsing, methods on the handler objects are called based on structural and syntactic events from the input data.
For details see Python Library Reference (8.9 xml.sax)
Find a CD with a given title attribute in the document cd.xml:
<?xml version='1.0'?>
<collection>
<cd artist="Canto Loco"
title="Los Locos Computadores">
<track title="Los Locos Computadores"
duration="3m22s"/>
<track title="Los Locos Hombres"
duration="2m15s"/>
</cd>
<cd artist="Canto Loco"
title="Los Todos Locos">
<track title="Los Todos Locos"
duration="3m22s"/>
</cd>
</collection>
import sys from xml.sax import make_parser, handler
Define a content handler class:
class FindTitle(handler.ContentHandler):
def __init__(self, title):
self.search_title = title
def startElement(self, name, attrs):
"""start element handler for finding titles"""
# only do work for cd elements
if name != 'cd': return
# check the title attribute
else: title = attrs.get('title', None)
if (title == self.search_title):
print title, 'found'
Main program:
if __name__ == '__main__':
# Create a parser
parser = make_parser()
# Create the handler
dh = FindTitle('Los Locos Computadores')
# Tell the parser to use our handler
parser.setContentHandler(dh)
# Parse the command line input: saxparser.py cd.xml
# parser.parse(sys.argv[1])
# parse the XML document
parser.parse('cd.xml')
The Document Object Model provides a tree based interface to an XML document.
The entire document is read and a parse tree is constructed in memory. Your program then accesses the parse tree.
Your program traverses the tree retrieving the items of interest.
DOM is good for smaller XML documents or when the tree model fits your data structures well.
from xml.dom import minidom
xmldoc = minidom.parse('cd.xml')
for node in xmldoc.getElementsByTagName('cd'):
attrobject = node.attributes['title']
if (attrobject.value == 'Los Locos Computadores'):
print attrobject.value
Writing XML is easier than reading it.
Don't need a toolkit but one can be used to make things easier (cf. writing HTML).
Can use the DOM: construct a tree in memory, then write.
Often libraries provide facilities to write data corresponding to some XML DTD.
cd element that is the child
of the collection element:
/collection/cd[1]
cd element that is the child of
the collection element:
/collection/cd[last()]
cd elements that are children
of the collection element:
/collection/cd[position()<5]
type
with a value of 'group':
//artist[@type = 'group']
title elements of the cd elements
of the collection element that have a price
element with a value smaller than 20.00:
/collection/cd[price<20.00]/title
Define with processing instruction:
<?xml-stylesheet type="text/css" href="style.css"?>
display property to define
whether an element is a block or inline element.
<?xml version="1.0" ?>
<?xml-stylesheet type="text/css" href="sample6.css"?>
<THING> This is a thing with an inline stack:
<STACK>
<ROW>This is the <D>top</D> row.</ROW>
<ROW>This is the <D>middle</D> row.</ROW>
<ROW>This is the <D>bottom</D> row.</ROW>
</STACK>
which was displayed just there.
</THING>
View with Empty stylesheet, the stylesheet below
THING { display: block; border: 1px solid red;}
STACK { display: inline-table; }
ROW { display: table-row; background: blue; color: white; }
D { display: inline; font-weight: bolder; color: red; }
A fundamental limitation of CSS is that it only specifies the appearance of what's there in the document.
CSS can't re-order parts of the document: no TOC, no index.
So, there's room for more XML presentation technologies: those which can create one XML document from one or more others (eg. XSLT).
XHTML + CSS remains a powerful document delivery format for the web.
names into HTML <b>
(bold) elements:
<xsl:template match = "name"> <b><xsl:apply-templates/></b> </xsl:template>
<wines>
<wine grape = "chardonnay">
<product> Carneros </product>
<year> 2002 </year>
<price> 12.00 </price>
</wine>
</wines>
This example shows transformation of one XML dialect to another using an XSLT stylesheet.
<wines>
<wine grape = "chardonnay">
<product> Carneros </product>
<vintage> 2002 </vintage>
</wine>
</wines>
<xsl:stylesheet
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"
version = "1.0">
<xsl:template match = "year">
<vintage>
<xsl:apply-templates/>
</vintage>
</xsl:template>
<xsl:template match = "price"></xsl:template>
<!-- Copy all the other elements and attributes, and text nodes -->
<xsl:template match = "*|@*|text()">
<xsl:copy>
<xsl:apply-templates select = "*|@*|text()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
<?xml version = "1.0"?>
<document>
<title> COMP249: Web Technology </title>
<para> This is the main page of COMP249 where all information ... </para>
</document>
<html>
<head>
<meta http-equiv = "Content-Type" content = "text/html; charset = utf-8">
<title> COMP249: Web Technology </title>
</head>
<body>
<h1> COMP249: Web Technology </h1>
<p> This is the main page of COMP249
where all information ... </p>
</body>
</html>
<xsl:transform
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" version = "1.0">
<xsl:output method = "html"/>
<xsl:template match = "document">
<html><head>
<title>
<xsl:value-of select = "./title"/>
</title>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match = "title">
<h1><xsl:apply-templates/></h1>
</xsl:template>
<xsl:template match = "para">
<p><xsl:apply-templates/></p>
</xsl:template>
</xsl:transform>
<xsl:stylesheet xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"
version = "1.0">
...
</xsl:stylesheet>
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"
identifies the W3C XSL recommendation namespace.
<?xml version = "1.0"?>
<?xml-stylesheet type = "text/xsl" href = "comp249-style.xsl"?>
<document>
<title> COMP249: Web Technology </title>
<para> This is the main page of COMP249 where all information ... </para>
</document>
comp249-style.xsl).
java -jar saxon.jar comp249-in.xml comp249-style.xsl > comp249-out.html
Other ways exist for transforming XML documents: