Abstract. This elucidative program documents the development of a simple XML parser for the
LAML software package.
This is a non-validating XML parser for LAML. As of the current version,
the parser is by no means complete. Never the less, it is useful tool for parsing most
everyday XML document to a Lisp data structure. Given a well-formed XML document this parser returns a Lisp tree structure that represents the parse tree of the XML document. This parser skips any element of the XML prolog (front matters) including the DTD (if any). The parser also skips XML comments. The parser handles start tags, end tags, and empty tags (in this parser called start-end tags). Entities and their declarations are not handled at all. The top level function is xml-parse and xml-parse-file. See also the brief reference manual of the XML parser. |
1 Introduction In this section we describe the overall parsing ideas together with examples. |
1.1 The start 1.2 Handling white space, comments, and front matters. 1.3 The parse tree 1.4 Parse tree functions |
1.1 The start |
We establish a parse stack, cf. parse-stack. This is plain and ordinary stack to which we make a good interface. Alternatively, we could have used a generic stack ADT.
The function parse-xml-ip starts the real work. After handling empty initial space and front matters (the XML prefix - see section 1.2) the ball is passed to parse-xml-balanced-expression which does the real job. More will follow in section 2.1.
1.2 Handling white space, comments, and front matters. |
The skipping of white space using skip-white-space is simple. The function skip-white-space-and-comments also skips XML comments. Notice the recursive implementation of skip-white-space-and-comments which ensures that all white space - including multiple comments - may be skipped.
1.3 The parse tree |
T = (tree N ST1 ST2 ... STn)STi, i = 1..n are trees recursively. tree is a symbol (tagging a syntax tree). A Leaf node is denoted as
(tree N)or just N if N is a string (corresponding to textual contents) or an empty tag (a tag without contents, also called a start-end tag in this software).
An inner node of a parse tree corresponds to a tag (an element) with contents. Such a node is represented by the following 'tag structure':
(tag kind tag-name . attr-info)Kind is either start, start-end, or end (corresponding to the three kinds of tags). Tag-name is a string. Attr-info is the attribute on property list format. An terminal may be a start-end node, or just a contents string. End tags are not represented in the parse tree.
(tag start "title" role "xxx" size "5")The function make-tag-structure constructs the nodes. The function kind-of-tag-structure returns the kind of a node (tag structure); whether start or start-end. There are also functions to access the constituents of a tag structure: tag-of-tag-structure and attributes-of-tag-structure.
If we assume that we parse the following XML fragment (If you are reading this in the source file, please consult the browser version):
<title role = "xxx" size = "5"> Here is some contents of title </title>The tree becomes:
((start "title" role "xxx" size "5") "Here is some contents of title")A more complicated example
<title role = "xxx" size = "5"> Here is some contents of title <mark number = "1" listed = "yes"/> More text. <section number = "1"> Section text. </section> </title>The tree is (slighly formatted, but accurated- actually parsed):
(tree (tag start "title" role "xxx" size "5") "Here is some contents of title " (tag start-end "mark" number "1" listed "yes") "More text. " (tree (tag start "section" number "1") "Section text. "))
1.4 Parse tree functions |
The predicate terminal-node? tells whether a tree is considered an terminal node. It is an terminal node if it is a contents string (direct on convoluted in a node), or if it is start-end-node (which cannot have contents). The complementing function function is called inner-node?. Both of these functions assume that they are applied on a tree.
We also have prediates on nodes: start-node? and start-end-node?.
2 The parser Here we explain the parser as such. |
2.1 Overall parsing functions 2.2 Parsing a balanced XML expression. 2.3 Parsing the contents after a start tag 2.4 Building the tree |
2.1 Overall parsing functions |
The function what-is-ahead looks ahead without really reading. All white space is consumed, so what-is-ahead faces a non-blank character. It is easy to spot if it is a tag, starting with start angle. If not, it is a string (contents).
The function read-tag is more serious. It parses a tag (start tag, end tag, or start-end tag) including the attributes. This function can read all kinds of tags, including an end tag. Internally, it is pretty straightforward. The first case to handle is the end tag (). In the conditional () we handle the following three cases: The parsing of the attributes in the start or start-end tag is done by the function read-tag-attributes. The rest of this last case has to do with the determination of whether we are facing a start tag or a start-end tag.
The function read-tag-attributes parses a list of attributes. The task of parsing a single attribute, like
attr = "value"is done by read-attribute-value-pair. Again, this is just detailed parsing work using the collect-skip library.
2.2 Parsing a balanced XML expression. |
The function assumes that all initial white space is consumed.
We start by finding out if we are facing a tag (start or start-end tag presumably) or textual contents ().
In case of a tag we handle the start, and start-end tag separately. The function read-tag reads the tag.
Let us first notice that the start-end tag is easy to handle, because it is terminal; So we just return the
node (the result returned by read-tag).
The most interesting and important case is the start tag (). We push the start tag onto the stack.
The function read-and-push-subtrees-until-end-tag pushes additional contributions onto the stack -
all read recursively by parse-xml-balanced-expression via read-and-push-subtrees-until-end-tag.
Finally the function build-tree-from-stack takes information from the stack (and pops the stack)
in order to build and return the parse tree. In the next section we will study read-and-push-subtrees-until-end-tag
and build-tree-from-stack.
If we encounter an end tag () there is an error which we report in the final case.
2.3 Parsing the contents after a start tag |
<title role = "xxx" size = "5"> Here is some contents of title <mark number = "1" listed = "yes"/> More text. <section number = "1"> Section text. </section> </title>we have just pushed information about the start tag on to the stack. read-and-push-subtrees-until-end-tag must read
We establish a look ahead corresponding to the length of end tag, including the "<", ">", and "\". This is the local name n (). If we encounter the end tag string () we just read it - and we are done. If not () we parse a subtree using parse-xml-balanced-expression, push it on to the stack, and iterates tail recursively via read-and-push-subtrees-until-end-tag.
2.4 Building the tree |
Recall that parse-xml-balanced-expression pushed the start tag on to the stack. Consequently, we pop the stack until we meet the tag corresponding to end-tag-name (the parameter of build-tree-from-stack.)
Given our example from section 2.3 the stack is (top on first line):
The section subtree "More text" mark start-end-tag "Here is some contents of title" start title tagThis is done by build-tree-from-stack-1, which is called in a trivial way by build-tree-from-stack. build-tree-from-stack-1 pops the stack iteratively until it meets the start tag (first parameter of build-tree-from-stack-1). The predicate start-tag-entry? identifies a start tag entry, and the function matches-stack-entry finds out if the top tag structure matches the start tag name. Notice that all stack entries are taged, such we can see if we face a tag structure (see the predicates tag-stack-entry?, tree-stack-entry?, text-contents-entry?, start-tag-stack-entry?, and start-end-tag-stack-entry?).
3 A HTML variant In this section we will explain the differences between the XML parser, as developed until now, and an HTML parser. |
3.1 Important differences |
3.1 Important differences |
<body> Some text. <p> Some more text <p> </body>and
<body> <p> Some text.</p> <p> Some more text </p> </body>is legal. The parsing is not the same, however. The problem is that the first fragment we never meet the end of the two p tags. So a balanced parsing of expression does not work here.
But what works then? We use the following idea:
Thus, we need to re-program most of the parser, including of course the central parse-xml-balanced-expression (which is recursive through read-and-push-subtrees-until-end-tag). There is an accompanying source file, html-support.scm, where these contributions can be found.