Generated: Monday, November 14, 2011, 09:19:45 Copyright © 2011 , Kurt Nørmark The local LAML software home page

Reference Manual of the XML Document Type Definition Parser

Kurt Nørmark © normark@cs.aau.dk Department of Computer Science, Aalborg University, Denmark.

LAML Source file: tools/dtd-parser/dtd-parser-4.scm

This tool parses an XML Document Type Definition (DTD). The parsed result is represented as a flat list of element, attribute, and entity descriptors.

This version of the XML DTD parser also parses the content models of the XML elements. This is the basis for the fully automatic synthesis of finite state automata for validation of XML-in-LAML documents at document generation time.

Another tool - the XML-in-LAML mirror generation tool - is able to produce a Scheme mirror of the XML language. The mirror generation tool takes as input the data structures, which are produced by the DTD parser.

There is some internal elucidative documentation of the DTD parser at the LAML development site.

Please consult "XML mirrors in Scheme: XML in LAML" (section 2) for a tutorial introduction to the use of the parser.

The DTD parser is not perfect, but we steadily improve it when it is needed. See the README file in the directory of the source files for details and progress.

An earlier (and now obsolete) version of this tool was used to parse the HTML4.01 DTD (which is a non-XML DTD).

Table of Contents:
1. Introduction and usage. 3. Variables which control the parser.
2. The main parsing function. 4. Variables which is assigned to parser output.

Alphabetic index:
attribute-list attribute-list A list of attributes defined by the parser
dtd-parse-verbose dtd-parse-verbose A boolean variable which controls whether the parser should report on the progress while parsing.
element-list element-list A list of elements defined by the parser
entity-list entity-list A list of entities defined by the parser
forced-inclusion-of-marked-sections forced-inclusion-of-marked-sections A boolean variable which controls whether to parse ignored marked sections anyway.
notation-list notation-list A list of notations defined by the parser
parse-dtd (parse-dtd file . non-expanding-entities) Parses a dtd file to Lisp expressions and writes the parsed result to another file.
parse-rhs-string (parse-rhs-string rhs-str) Top level rhs parser function.


1 Introduction and usage.
The easiest way to parse an XML DTD is to use the procedure xml-dtd-parse located in laml.scm in the root of LAML distribution. This procedure can be called from a LAML prompt.

Here is another slightly more low-level way to use the parser.

  1. Start a Scheme interpreter, such as SCM or MzScheme.
  2. Define the variable laml-dir to the LAML directory path on your computer.
  3. Load this file (dtd-parser-4.scm in tools/dtd-parser) into a Scheme interpreter, for instance from the directory in which the dtd-parser.scm source file resides.
  4. Call the function (parse-dtd "dtd-file-without-extension"). Additional parameters of parse-dtd are interpreted as non-expanding entities. A call of the function will give you a similar Lisp parsed file with lsp extension. In addition the variables element-list, attribute-list, and entity-list will be defined by the parser. The resulting file contains a "flat" concatenation of these three lists.
Use of TABs in the DTD source file may cause troubles in terms of lexical misinterpretations. Therefore it is recommended to convert TABs to spaces before parsing the DTD.

We now describe the format of an entity, an attribute, and an element in the parsed result. Each unit in the result of the parsing is either an element, an attribute, or an entity. An element describes, in popular terms, "a single tag" in the markup language. An attribute describes the attribute of an element. Thus, in general, there is a one-to-one correspondance between the element list and the attribute-list. The list of entities are macro expansions, which have been applied to achieve a clean and parsable dtd file. The entity list is of minor importance once the file has been parsed.

The first element in each of the top level form of the parsed result distinguishes this element from each others (tagging).

  • Element format:
    (element name start-tag-status end-tag-status content-model comment)

The start-tag-status and end-tag-status are left overs from SGML and HTML4, see the HTML4.01 definition. These two status fields are not used for XML. The content model is a symbol (any or empty) or a list prefixed with either mixed-content or element-content.

Mixed contents list have the form

 (mixed-content pcdata) (mixed-content (choice zero-or-more pcdata NAME-STRING NAME-STRING ... NAME-STRING)) 
Element contents lists have the form
  (element-contents CONT) 
where

CONT ::= ( KIND MULTIPLICITY DATA+ )
KIND ::= name | seq | choice | empty
MULTIPLICITY ::= one | optional | zero-or-more | one-or-more
DATA ::= NAME-STRING | CONT

and where NAME-STRING is a placeholder for a string constant.

Here is information about the parsed formats of attributes and entities:

  • Attribute format:
    (attribute name list-of-attribute-triples).
    An attribute triple consist of attribute name, attribute type (most basic type, no entities), and "availability status".

  • Entity name:
    (entity name entity-expansion).


2 The main parsing function.
There is only one function, parse-dtd, to learn. This function represents and "is" the tool.

parse-dtd
Form (parse-dtd file . non-expanding-entities)
Description Parses a dtd file to Lisp expressions and writes the parsed result to another file. The parameter file is without extension. The second parameter is a list of entity names (strings without leading percent chars); no entity in the list will be expanded. This function also defines the lists element-list, attribute-list, and entity-list. The parsed result is the appending of element-list, attribute-list, and entity-list. It is assumed that the DTD file resides on file.dtd. This function writes the result to the file.lsp.


3 Variables which control the parser.

dtd-parse-verbose
Form dtd-parse-verbose
Description A boolean variable which controls whether the parser should report on the progress while parsing. Default #t.

forced-inclusion-of-marked-sections
Form forced-inclusion-of-marked-sections
Description A boolean variable which controls whether to parse ignored marked sections anyway. Default #t.


4 Variables which is assigned to parser output.

entity-list
Form entity-list
Description A list of entities defined by the parser

element-list
Form element-list
Description A list of elements defined by the parser

attribute-list
Form attribute-list
Description A list of attributes defined by the parser

notation-list
Form notation-list
Description A list of notations defined by the parser

parse-rhs-string
Form (parse-rhs-string rhs-str)
Description Top level rhs parser function.

Generated: Monday, November 14, 2011, 09:19:46
Generated by LAML SchemeDoc using LAML Version 38.0 (November 14, 2011, full)