The XML parser for LAML

Kurt Nørmark © normark@cs.auc.dk Department of Computer Science, Aalborg University, Denmark

Abstract. This elucidative program documents the development of a simple XML parser for the LAML software package. This is a non-validating XML parser for LAML. As of the current version, the parser is by no means complete. Never the less, it is useful tool for parsing most everyday XML document to a Lisp data structure.
Given a well-formed XML document this parser returns a Lisp tree structure that represents the parse tree of the XML document. This parser skips any element of the XML prolog (front matters) including the DTD (if any). The parser also skips XML comments. The parser handles start tags, end tags, and empty tags (in this parser called start-end tags). Entities and their declarations are not handled at all.
The top level function is xml-parse and xml-parse-file.
See also the brief reference manual of the XML parser.

1  Introduction
In this section we describe the overall parsing ideas together with examples.
1.1  The start
1.2  Handling white space, comments, and front matters.
1.3  The parse tree
1.4  Parse tree functions

1.1 The start
We first make a top level parse function, parse-xml (and parse-xml-file) which we model after the similar function in the dtd parser, which we have made earlier. We decide to read the input from a text file - say f.xml. parse-xml-file. The name of the resulting file is determined by the second parameter of parse-xml-file.

We establish a parse stack, cf. parse-stack. This is plain and ordinary stack to which we make a good interface. Alternatively, we could have used a generic stack ADT.

The function parse-xml-ip starts the real work. After handling empty initial space and front matters (the XML prefix - see section 1.2) the ball is passed to parse-xml-balanced-expression which does the real job. More will follow in section 2.1.

1.2 Handling white space, comments, and front matters.
An XML document typically starts with a prolog, consisting of an XML declaration, document type defintion, and perhaps an inline DTD. This parser does not need to care about all that stuff. Thus, we only want to jump over such material. The function skip-front-matters does it. Notice the skipping of XML declarations () which makes use of the function collect-balanced-until from the collect skip library. The use of this function ensures that we can handle embedded declarations, as long they are balanced.

The skipping of white space using skip-white-space is simple. The function skip-white-space-and-comments also skips XML comments. Notice the recursive implementation of skip-white-space-and-comments which ensures that all white space - including multiple comments - may be skipped.

1.3 The parse tree
The tree structure is as follows:

  T = (tree N ST1 ST2 ... STn)

STi, i = 1..n are trees recursively. tree is a symbol (tagging a syntax tree). A Leaf node is denoted as

  (tree N)

or just N if N is a string (corresponding to textual contents) or an empty tag (a tag without contents, also called a start-end tag in this software).

An inner node of a parse tree corresponds to a tag (an element) with contents. Such a node is represented by the following 'tag structure':

  (tag kind tag-name . attr-info)

Kind is either start, start-end, or end (corresponding to the three kinds of tags). Tag-name is a string. Attr-info is the attribute on property list format. An terminal may be a start-end node, or just a contents string. End tags are not represented in the parse tree.

  (tag start "title" role "xxx" size "5")

The function make-tag-structure constructs the nodes. The function kind-of-tag-structure returns the kind of a node (tag structure); whether start or start-end. There are also functions to access the constituents of a tag structure: tag-of-tag-structure and attributes-of-tag-structure.

If we assume that we parse the following XML fragment (If you are reading this in the source file, please consult the browser version):

  <title role = "xxx" size = "5">
    Here is some contents of title
  </title>

The tree becomes:

  ((start "title" role "xxx" size "5")
     "Here is some contents of title")

A more complicated example

  <title role = "xxx" size = "5">
    Here is some contents of title
    <mark number = "1" listed = "yes"/>
    More text.
    <section number = "1">
      Section text.
    </section>
  </title>

The tree is (slighly formatted, but accurated- actually parsed):

(tree
  (tag start "title" role "xxx" size "5")
  "Here is some contents of title " 
  (tag start-end "mark" number "1" listed "yes")
   "More text. "
  (tree 
    (tag start "section" number "1") 
    "Section text. "))

1.4 Parse tree functions
The function make-parse-tree constructs a parsetree, and the functions root-of-parse-tree and subtrees-of-parse-tree access the root and the subtrees of a parse tree respectively.

The predicate terminal-node? tells whether a tree is considered an terminal node. It is an terminal node if it is a contents string (direct on convoluted in a node), or if it is start-end-node (which cannot have contents). The complementing function function is called inner-node?. Both of these functions assume that they are applied on a tree.

We also have prediates on nodes: start-node? and start-end-node?.

2  The parser
Here we explain the parser as such.
2.1  Overall parsing functions
2.2  Parsing a balanced XML expression.
2.3  Parsing the contents after a start tag
2.4  Building the tree

2.1 Overall parsing functions
The function parse-xml-balanced-expression is supposed to return a (sub)tree, and thus read a balanced expression from the input port. There are three possible results:

A tree rooted of a start node.
A terminal node of start-end node.
A contents string.

The function what-is-ahead looks ahead without really reading. All white space is consumed, so what-is-ahead faces a non-blank character. It is easy to spot if it is a tag, starting with start angle. If not, it is a string (contents).

The function read-tag is more serious. It parses a tag (start tag, end tag, or start-end tag) including the attributes. This function can read all kinds of tags, including an end tag. Internally, it is pretty straightforward. The first case to handle is the end tag (). In the conditional () we handle the following three cases:

An start-end tag without attributes ().
A start tag ().
A start-end tag with attributes ().

The parsing of the attributes in the start or start-end tag is done by the function read-tag-attributes. The rest of this last case has to do with the determination of whether we are facing a start tag or a start-end tag.

The function read-tag-attributes parses a list of attributes. The task of parsing a single attribute, like

  attr = "value"

is done by read-attribute-value-pair. Again, this is just detailed parsing work using the collect-skip library.

2.2 Parsing a balanced XML expression.
As already mentioned in section 2.1 the function parse-xml-balanced-expression is central. We will here examine it in more details.

The function assumes that all initial white space is consumed. We start by finding out if we are facing a tag (start or start-end tag presumably) or textual contents ().

In case of a tag we handle the start, and start-end tag separately. The function read-tag reads the tag.

Let us first notice that the start-end tag is easy to handle, because it is terminal; So we just return the node (the result returned by read-tag).

The most interesting and important case is the start tag (). We push the start tag onto the stack. The function read-and-push-subtrees-until-end-tag pushes additional contributions onto the stack - all read recursively by parse-xml-balanced-expression via read-and-push-subtrees-until-end-tag. Finally the function build-tree-from-stack takes information from the stack (and pops the stack) in order to build and return the parse tree. In the next section we will study read-and-push-subtrees-until-end-tag and build-tree-from-stack.

If we encounter an end tag () there is an error which we report in the final case.

2.3 Parsing the contents after a start tag
The procedure read-and-push-subtrees-until-end-tag must read all the textual contents after the start tag. To be concrete, while parsing

  <title role = "xxx" size = "5">
    Here is some contents of title
    <mark number = "1" listed = "yes"/>
    More text.
    <section number = "1">
      Section text.
    </section>
  </title>

we have just pushed information about the start tag on to the stack. read-and-push-subtrees-until-end-tag must read

"Here is some contents of title"
The mark terminal node.
"More text"
The section subtree

After that it must encounter the title end tag and use this as a stopping condition. This is the most trickiest part of this function.

We establish a look ahead corresponding to the length of end tag, including the "<", ">", and "\". This is the local name n (). If we encounter the end tag string () we just read it - and we are done. If not () we parse a subtree using parse-xml-balanced-expression, push it on to the stack, and iterates tail recursively via read-and-push-subtrees-until-end-tag.

2.4 Building the tree
We are now in a possition where the constituents of the parse tree is on the stack, and we have encountered and read an end tag. Now we shall build the parse tree. This is done by build-tree-from-stack.

Recall that parse-xml-balanced-expression pushed the start tag on to the stack. Consequently, we pop the stack until we meet the tag corresponding to end-tag-name (the parameter of build-tree-from-stack.)

Given our example from section 2.3 the stack is (top on first line):

  The section subtree
  "More text"
  mark start-end-tag
  "Here is some contents of title"
  start title tag

This is done by build-tree-from-stack-1, which is called in a trivial way by build-tree-from-stack. build-tree-from-stack-1 pops the stack iteratively until it meets the start tag (first parameter of build-tree-from-stack-1). The predicate start-tag-entry? identifies a start tag entry, and the function matches-stack-entry finds out if the top tag structure matches the start tag name. Notice that all stack entries are taged, such we can see if we face a tag structure (see the predicates tag-stack-entry?, tree-stack-entry?, text-contents-entry?, start-tag-stack-entry?, and start-end-tag-stack-entry?).

3  A HTML variant
In this section we will explain the differences between the XML parser, as developed until now, and an HTML parser.
3.1  Important differences

3.1 Important differences
HTML is not as clean and easy to deal with as XML. The main stumbling block is tags which can be applied as empty tags (without end tags) as well as with an end tag. As an example both

  <body>
    Some text.
    <p>
    Some more text
    <p>
  </body>

and

  <body>
    <p> Some text.</p>
    <p> Some more text </p>
  </body>

is legal. The parsing is not the same, however. The problem is that the first fragment we never meet the end of the two p tags. So a balanced parsing of expression does not work here.

But what works then? We use the following idea:

Read and stack tags and contents strings until we encounter some end tag.
When we encounter an end tag we pop the stack until we meet the closest similar start tag.
During our way down the stack we may pass some start tags, which we now realise never will be terminated by end tags. These must be converted to start-end tags (empty tags)
We form a tree of all the elements until and including the matcing start tag.
We push the subtree so formed.

Is this a nice recursive algorithm? NO. It is an iterative algorithm. We run until the stack is empty. It becomes empty when we meet the end html tag (or a similar enclosing tag surrounding the whole document).

Thus, we need to re-program most of the parser, including of course the central parse-xml-balanced-expression (which is recursive through read-and-push-subtrees-until-end-tag). There is an accompanying source file, html-support.scm, where these contributions can be found.