Generated: Monday, November 14, 2011, 09:19:58 Copyright © 2011 , Kurt Nørmark The local LAML software home page

Reference Manual of the HTML parser and pretty printer for LAML

Kurt Nørmark © normark@cs.aau.dk Department of Computer Science, Aalborg University, Denmark.

LAML Source file: tools/xml-html-support/html-support.scm

This is a non-validating HTML parser built on top of the simple XML parser for LAML. In addition there are HTML pretty printing procedures in this tool. The implementation of the parser is done by redefining functions from the XML parser. Most of the xml-parser stuff is reused in this parser.

The top-level node is called a html-tree, which may hold top level comment nodes and declaration nodes (docttype nodes). The parser represents HTML comments within the document as special comment nodes.

The parser will be very confused if it meets a less than or greater than character which isn't part of tag symbol. Such character must be HTML protected (use the special character entities in HTML).

As of LAML version 31, the parser is able to parse certain non-wellfored HTML document (documents with crossing tags). This tool assumes that laml.scm and the general library are loaded. The tool loads xml-support (which is the starting of this html support tool), collect-skip and file-read libraries.

See the XML support for information about the format of parse trees and variables that control the pretty printing. See also the illustrative examples of the HTML parsing and pretty printing tools.

The typographical rebreaking and re-indenting of running text is still missing.

The LAML interactive tool procedures html-pp and html-parse in laml.scm are convenient top-level pretty printing and parse procedures respectively.

Known problem: The handling of spaces after the start tag and before the end tag is not correct.

Please notice that this is not a production quality parser and pretty printer! It is currently used for internal purposes.

Table of Contents:
1. Top level HTML parsing function. 2. HTML pretty printing functions.

Alphabetic index:
parse-html (parse-html file-path) This function parses a file and return the parse tree.
parse-html-file (parse-html-file in-file-path out-file-path) Parse the file in in-file-path, and deliver the parse tree in out-file-path.
parse-html-string (parse-html-string str) Parse the string str which is supposed to contain a HTML document.
pretty-print-html-parse-tree (pretty-print-html-parse-tree parse-tree) Pretty prints a HTML parse tree, and return the result as a string.
pretty-print-html-parse-tree-file (pretty-print-html-parse-tree-file in-file-path [out-file-path]) Pretty prints the HTML parse tree (lisp file) in in-file-path.


1 Top level HTML parsing function.

parse-html-file
Form (parse-html-file in-file-path out-file-path)
Description Parse the file in in-file-path, and deliver the parse tree in out-file-path. If in-file-path has an empty file extension, html is added.
See also Scheme source file parse-html-file

parse-html
Form (parse-html file-path)
Description This function parses a file and return the parse tree. Thus, the difference between this function and parse-html-file is that this function returns the parse tree (no file output). file-path is a file path (relative or absolute). An html extension is added, if necessary.
See also Scheme source file parse-html


2 HTML pretty printing functions.

pretty-print-html-parse-tree-file
Form (pretty-print-html-parse-tree-file in-file-path [out-file-path])
Description Pretty prints the HTML parse tree (lisp file) in in-file-path. Outputs the pretty printed result in out-file-path, which defaults to in-file-path if not explicitly passed.
See also Scheme source file pretty-print-html-parse-tree-file

pretty-print-html-parse-tree
Form (pretty-print-html-parse-tree parse-tree)
Description Pretty prints a HTML parse tree, and return the result as a string.
See also Scheme source file pretty-print-html-parse-tree

parse-html-string
Form (parse-html-string str)
Description Parse the string str which is supposed to contain a HTML document. The parsing is done by writing str to the temp dir in the LAML directory, and then using the function parse-html-file. Precondition: The temp dir of the LAML directory must exist.
See also Scheme source file parse-html-string

Generated: Monday, November 14, 2011, 09:19:58
Generated by LAML SchemeDoc using LAML Version 38.0 (November 14, 2011, full)