;;; @Package ASXT ;;; @Subtitle Another Scheme XML Transformer ;;; @HomePage http://www.neilvandyke.org/asxt/ ;;; @Author Neil W. Van Dyke ;;; @AuthorEmail neil@@neilvandyke.org ;;; @Version 0.1 ;;; @Date 31 December 2004 ;; $Id: asxt.scm,v 1.188 2004/12/31 21:01:41 neil Exp $ ;;; @legal ;;; Copyright @copyright{} 2004 Neil W. Van Dyke. This program is Free ;;; Software; you can redistribute it and/or modify it under the terms of the ;;; GNU Lesser General Public License as published by the Free Software ;;; Foundation; either version 2.1 of the License, or (at your option) any ;;; later version. This program is distributed in the hope that it will be ;;; useful, but without any warranty; without even the implied warranty of ;;; merchantability or fitness for a particular purpose. See the GNU Lesser ;;; General Public License [LGPL] for details. For other license options and ;;; commercial consulting, contact the author. ;;; @end legal ;;; @section Introduction ;;; ;;; ASXT is another XML transformation library for Scheme.@footnote{``ASXT'' ;;; may be pronounced phonetically, in which case the following mnemonic might ;;; be helpful:@* ``Thrice have I ASXT thee to transform!''} It works with ;;; [SXML], including that with extraneous list nesting and nulls, and with ;;; [HtmlPrag]'s SHTML variant of SXML. ;;; ;;; ASXT is conceptually very simple, and may at some future time be an ;;; underlying mechanism in a higher-level language. At this time, ASXT is ;;; being used as a practical tool while we attempt to identify what ;;; higher-level features and syntax would be most useful for our applications. ;;; ;;; Presently, ASXT is somewhat similar to Oleg Kiselyov's [Pre-post-order] and ;;; Kirill Lisovsky's [STX], and indeed has been informed and inspired by ;;; those. ASXT might be described as @code{pre-post-order} with inheritance ;;; and only preorder traversal, and as @code{stx-engine} without [SXPath]. ;;; ASXT's simplifications were made experimentally, as a starting point for ;;; building higher-level features that might be better-suited to some ;;; applications. We also expect to incorporate ideas from Jim Bender's ;;; [WebIt]. ;;; ;;; The ASXT implementation includes a simple ``compiler'' that produces a ;;; closure from input ASXT language. This closure, which may be applied to ;;; SXML nodes, should do a fairly efficient job of recursively dispatching on ;;; SXML node types to closures, and of assembling the result tree. In most ;;; cases, null lists and extraneous list nestings produced by user-supplied ;;; closures are discarded quickly and are not included in the output SXML. ;;; ;;; The ASXT library requires only R5RS and [SRFI-23]. ;;; @section Language ;;; ;;; The roughly EBNF grammar (in which parentheses represent sexp lists) for ;;; the ASXT language is: ;;; ;;; @example ;;; ::= | ;;; ;;; ::= | *error* | *null* | *same* ;;; ;;; ::= ( [ *inherit* ] ;;; [ ( *text* ) ] ;;; [ ( *default* ) ] ;;; @{ ( ) @}* ) ;;; ;;; ::= | ( @{ @}+ ) ;;; @end example ;;; ;;; The starting nonterminal is @i{}. An action is something to which ;;; an SXML node is applied, and the ASXT starting action is applied to the ;;; top-level node of the SXML tree. @i{} can be a @i{} ;;; or @i{}, but is virtually always the latter for top-level nodes. ;;; ;;; @i{} can be a closure, which accepts an SXML node, or one of ;;; three symbols: @code{*error*} causes an error to be signaled, indicating ;;; that the node is syntactically invalid; @code{*null*} yields the null list ;;; and performs no further processing on the node or its children; and ;;; @code{*same*} yields the node with no further processing. ;;; ;;; @i{} is a list that describes a mapping from SXML elements to ;;; actions. When bindings are nested, with a child set of bindings as an ;;; action within a parent set of bindings, then the child bindings are applied ;;; in that action to the SXML child nodes of the SXML parent element node that ;;; triggered the action. This recursive tree traversal is guaranteed to be ;;; performed preorder (depth-first), and side-effects are permitted. ;;; ;;; Within @i{}, @code{*text*} matches SXML text nodes, and ;;; @i{} is a symbol that matches an SXML element of that name. ;;; @code{*default*} matches any SXML element that is not matched by any given ;;; @i{}. @code{*default*} does not match SXML text nodes. ;;; ;;; If the first element of a child @i{} is @code{*inherit*}, then ;;; individual bindings of the immediate parent @i{} (including ;;; @code{*default*} and any bindings inherited by the parent) are inherited if ;;; not overridden in the child @i{}. ;;; ;;; If a @i{} does not have @code{*inherit*}, then the following ;;; bindings are effectively inherited if not overridden: ;;; ;;; @lisp ;;; ((*text* *error*) ;;; (*default* *error*) ;;; (*COMMENT* *null*)) ;;; @end lisp ;;; ;;; ASXT is normally executed using the @code{asxt} procedure, which accepts a ;;; compiled uncompiled ASXT language and an input SXML. @code{asxt} may be ;;; used within @i{} action, such as to interpose additional behavior at ;;; a step in an overall ASXT traversal. Note that, since the @i{} ;;; closure will be applied to the node that was matched for the action, use ;;; the @code{cdr} of that node to descend to child nodes in the traversal, as ;;; is shown by the following two functionally equivalent procedures: ;;; ;;; @lisp ;;; (asxt '((a ((b ((*text* *same*) ;;; (c ((d *same*)))))))) ;;; sxml) ;;; ;;; (asxt `((a ((b ,(lambda (node) ;;; (asxt '((*text* *same*) ;;; (c ((d *same*)))) ;;; (cdr node))))))) ;;; sxml) ;;; @end lisp ;;; ;;; @noindent ;;; For @code{sxml} input @code{(a (b (c (d "1"))) (b "2")))}, both expressions ;;; yield @code{((d "1") "2")}. ;;; ;;; As an example of using ASXT to extract the links from an HTML input (more ;;; precisely, the values of the @code{href} attributes of @code{a} elements), ;;; using @code{html->shtml} from [HtmlPrag]: ;;; ;;; @lisp ;;; (asxt '((*text* *null*) ;;; (*default* (*inherit*)) ;;; (a (*inherit* ;;; (@@ ((*default* *null*) ;;; (href ((*text* *same*)))))))) ;;; (html->shtml ;;; "My Title ;;;

This isn't Yahoo.

;;;

au.org is short.

;;; ")) ;;; @result{} ("http://www.yahoo.com/" "http://www.au.org/") ;;; @end lisp ;;; ;;; @noindent ;;; Naturally, this particular extraction task can be solved much more easily ;;; using XPath. The purpose of the example is to illustrate ASXT using a ;;; familiar problem. Also, in real-world ASXT programs, good practice would ;;; be to yield a list of SXML elements rather than a list of strings, as ;;; @code{asxt} might conceivably concatenate the strings as if they were SXML. ;;; @section API ;;; ;;; The ASXT library provides a few Scheme variables and syntax extensions. ;;; @defvar asxt-comment-symbol ;;; @defvarx asxt-entity-symbol ;;; @defvarx asxt-pi-symbol ;;; @defvarx asxt-top-symbol ;;; ;;; As a convenience for Scheme code that must be portable to implementations ;;; with case-sensitivity issues, these variables are bound to the following ;;; upper-case symbols used by SXML, respectively: @code{*COMMENT*}, ;;; @code{*ENTITY*}, @code{*PI*}, @code{*TOP*} (define asxt-comment-symbol (string->symbol "*COMMENT*")) (define asxt-entity-symbol (string->symbol "*ENTITY*")) (define asxt-pi-symbol (string->symbol "*PI*")) (define asxt-top-symbol (string->symbol "*TOP*")) ;;; @defproc asxt action sxml ;;; ;;; Applies ASXT @var{action} to each of the top level nodes of input ;;; @var{sxml} in turn, and yields the collected result. (define (asxt action sxml) ;; TODO: Maybe permit "sxml" to be an input-port. (asxt-internal:result-tidy ((if (procedure? action) action (compiled-asxt action)) ;; TODO: Here we put "sxml" inside a dummy element, so that the bindings ;; handler is always called with an element list. If all input "sxml" had ;; a valid "*TOP*" element, this would not be necessary. This dummy ;; element also permits us to check for a valid "*TOP*" element. But maybe ;; we should rework things so that we don't need this. This "result-tidy" ;; is not the best idea in any case. (list '() sxml)))) (define (asxt-internal:result-tidy result) (reverse (asxt-internal:result-prepend result '()))) ;;; @defproc compiled-asxt action ;;; ;;; Yields a closure that is the result of compiling ASXT @var{action}. Note ;;; that the @var{asxt} procedure will automatically compile its ASXT if ;;; necessary, so @code{compiled-asxt} is mainly useful for avoiding redundant ;;; compilations when @code{asxt} is called many times with the same ASXT. For ;;; example: ;;; ;;; @lisp ;;; (compiled-asxt '((message ((param ((*default* *same*))))))) ;;; @result{} # ;;; @end lisp (define asxt-internal:no-inherit '*no-inherit*) (define (compiled-asxt action) (asxt-internal:compile action asxt-internal:no-inherit asxt-internal:no-inherit asxt-internal:no-inherit)) (define (asxt-internal:error-proc node) (error "Invalid SXML element in this context:" node)) (define (asxt-internal:null-proc node) '()) (define (asxt-internal:same-proc node) node) (define (asxt-internal:compile action inherited-text-action inherited-default-pair inherited-elem-map) (cond ((procedure? action) action) ((null? action) (error "Null ASXT bindings list")) ((symbol? action) (case action ((*error*) asxt-internal:error-proc) ((*null*) asxt-internal:null-proc) ((*same*) asxt-internal:same-proc) (else (error "Invalid ASXT action symbol:" action)))) ((list? action) (if (eq? (car action) '*inherit*) (if (eq? inherited-text-action asxt-internal:no-inherit) (error "Top-level ASXT bindings cannot inherit:" action) (asxt-internal:compile2 (cdr action) inherited-text-action inherited-default-pair inherited-elem-map)) (asxt-internal:compile2 action asxt-internal:error-proc (cons '*default* asxt-internal:error-proc) (list (cons asxt-comment-symbol asxt-internal:null-proc))))) (else (error "Invalid ASXT bindings:" action)))) ;; TODO: Maybe turn "text-action" into "text-pair", like "default-action" was. ;; It doesn't need to be, since we don't need a shared pair for patching up ;; inherited bindings with child bindings, since text nodes can't have child ;; bindings. ;; TODO: Now that we are constructing the child binding closures differently, ;; we can go back to our original almost-pure functional (everything but the ;; final patchup "set-cdr!"s) implementation of the compiler. (define asxt-internal:compile2 (lambda (bindings inherited-text-action inherited-default-pair inherited-elem-map) (let ((text-action #f) (default-pair #f) (elem-map '()) (error-elems '()) (child-bindings-map '())) ;; Process the binding lines. (for-each (lambda (binding) (if (not (and (pair? binding) (pair? (cdr binding)) (null? (cddr binding)))) (error "ASXT binding must be a list of length 2:" binding)) (let* ((names (car binding)) (val (cadr binding)) (action-pp (let ((ph (string->symbol "#"))) (lambda (val) (cond ((eqv? asxt-internal:null-proc val) '*null*) ((eqv? asxt-internal:same-proc val) '*same*) ((asxt-internal:patcher? val) ph) (else val))))) (bind (letrec-syntax ((already-bound (syntax-rules () ((_ ?name ?old-val) (error "Name bound to value before ASXT binding:" ?name (action-pp ?old-val) binding)))) (check-not-bound (syntax-rules () ((_ ?name) (let ((name ?name)) (cond ((memq name error-elems) (already-bound name '*error*)) ((assq name elem-map) => (lambda (val) (already-bound name val)))))))) (set-action (syntax-rules () ((_ ?name ?var ?val) (if ?var (already-bound ?name ?var) (set! ?var ?val))))) (add-elem (syntax-rules () ((_ ?name ?val) (let ((name ?name) (val ?val)) (check-not-bound name) (let ((pair (cons name val))) (set! elem-map (cons pair elem-map)) pair)))))) (cond ((procedure? val) (lambda (name) (case name ((*text*) (set-action name text-action val)) ((*default*) (set-action name default-pair (cons name val))) (else (add-elem name val))))) ((eq? val '*error*) (lambda (name) (case name ((*text*) (set-action name text-action asxt-internal:error-proc)) ((*default*) (set-action name default-pair (cons name asxt-internal:error-proc))) (else (check-not-bound name) (set! error-elems (cons name error-elems)))))) ((symbol? val) (let ((proc (case val ((*null*) asxt-internal:null-proc) ((*same*) asxt-internal:same-proc) (else (error "Invalid ASXT action symbol in binding:" binding))))) (lambda (name) (case name ((*text*) (set-action name text-action proc)) ((*default*) (set-action name default-pair (cons name proc))) (else (add-elem name proc)))))) ((list? val) (let ((patcher (cond ((assq val child-bindings-map) => cdr) (else (let ((patcher (asxt-internal:make-patcher))) (set! child-bindings-map (cons (cons val patcher) child-bindings-map)) patcher))))) (lambda (name) (if (eq? name '*text*) (error "ASXT *text* cannot have child bindings:" binding) ;; TODO: Clean this up, maybe by making ;; "set-action" return a pair (after ;; changing "text-action" to "text-pair"), ;; or by making add-elem accept a pair. (asxt-internal:patcher:add-pair! patcher (if (eq? name '*default*) (let ((pair (cons name patcher))) (set-action name default-pair pair) pair) (add-elem name patcher))))))) (else (error "Invalid ASXT action in binding:" binding)))))) (if (symbol? names) (bind names) (for-each bind names)))) bindings) ;; Inherit "text-action" and "default-pair". (set! text-action (or text-action inherited-text-action)) (set! default-pair (or default-pair inherited-default-pair)) ;; Inherit element bindings. (for-each (lambda (pair) (let ((name (car pair))) (or (assq name elem-map) (memq name error-elems) (set! elem-map (cons (cons name (cdr pair)) elem-map))))) inherited-elem-map) ;; Process "child-bindings-map". (for-each (lambda (n) (asxt-internal:patcher:doit (cdr n) (asxt-internal:compile (car n) text-action default-pair elem-map))) child-bindings-map) ;; Make procedure. (Note: We could make this procedure much earlier, ;; which we'd want to do if we want to go back to detecting cycles in ;; the ASXT input lists.) (asxt-internal:make-bindings-proc text-action (asxt-internal:make-elem-proc default-pair (if (or (null? error-elems) (eq? (cdr default-pair) asxt-internal:error-proc)) elem-map (append elem-map (map (lambda (name) (cons name asxt-internal:error-proc)) error-elems)))))))) (define (asxt-internal:make-elem-proc default-pair elem-map) ;; (pretty-print `("*DEBUG* make-elem-proc" ;; ("default-pair" ,default-pair) ;; ("elem-map" ,elem-map))) (lambda (node) ;; (pretty-print `("*DEBUG* HANDLING ELEMENT" ;; ("(car node)" ,(car node)) ;; ("default-pair" ,default-pair) ;; ("elem-map" ,elem-map))) ((cond ((assq (car node) elem-map) => cdr) (else (cdr default-pair))) node))) (define (asxt-internal:make-bindings-proc text-proc elem-proc) (lambda (node) (reverse (let loop ((result '()) (children (cdr node))) (if (null? children) result (loop (let ((child (car children))) (cond ((string? child) (cons (text-proc child) result)) ((null? child) result) ((list? child) (let ((child-name (car child))) (if (symbol? child-name) (if (eq? child-name '*text*) (error "Invalid SXML element name \"*text*\":" child) (asxt-internal:result-prepend (elem-proc child) result)) (loop result child)))) (else (error "Invalid object in SXML list:" child)))) (cdr children))))))) (define (asxt-internal:result-prepend to-prepend result) (cond ((null? to-prepend) result) ((and (list? to-prepend) (not (symbol? (car to-prepend)))) (let loop ((result result) (to-prepend to-prepend)) (if (null? to-prepend) result (loop (asxt-internal:result-prepend (car to-prepend) result) (cdr to-prepend))))) (else (cons to-prepend result)))) (define (asxt-internal:make-patcher) (vector '())) (define asxt-internal:patcher? vector?) (define (asxt-internal:patcher:add-pair! patcher pair) (vector-set! patcher 0 (cons pair (vector-ref patcher 0)))) (define (asxt-internal:patcher:doit patcher val) (for-each (lambda (pair) (set-cdr! pair val)) (vector-ref patcher 0))) ;;; @defproc top-added-asxt bindings ;;; ;;; Yields ASXT bindings that are @var{bindings} with additional bindings added ;;; to disregard any SXML @code{*TOP*} element that is present. For example: ;;; ;;; @lisp ;;; (top-added-asxt '((html (body @dots{})))) ;;; @result{} ;;; ((html (body @dots{})) ;;; ((@@ *PI*) *null*) ;;; (*TOP* (*inherit* ;;; (*TOP* *error*)))) ;;; @end lisp (define (top-added-asxt bindings) ;; TODO: Maybe make this not conflict with any existing bindings for "@" and ;; "*PI*". (append bindings (list (list (list '@ asxt-pi-symbol) '*null*) (list asxt-top-symbol (list '*inherit* (list asxt-top-symbol '*error*)))))) ;;; @defsyntax asxt-setter variable action ;;; ;;; Compiles @var{action} and defines an ASXT action closure that applies the ;;; compiled @var{action} and both binds @var{variable} to the result and ;;; yields the result. (define-syntax asxt-setter ;; TODO: Maybe add error-check for already set, although probably what we ;; really want is something schema-based anyway. (syntax-rules () ((_ ?variable ?action) (let ((proc (compiled-asxt ?action))) (lambda (node) (let ((val (proc node))) (set! ?variable val) val)))))) ;;; @defproc asxt-attr-value attr-node ;;; ;;; Yields the string value of SXML element attribute @var{attr-node}. This is ;;; used internally by @code{asxt-attr-setter}, but can also be used directly: ;;; ;;; @lisp ;;; (asxt-attr-value '(href ("http://") () (("foo.foo/")))) ;;; @result{} "http://foo.foo/" ;;; @end lisp ;;; ;;; @noindent ;;; Note that this procedure flattens lists and concatenates strings in the ;;; value, although in most cases element attribute values will be represented ;;; in SXML as single strings. (define asxt-attr-value (let ((text-same-bindings (compiled-asxt '((*text* *same*))))) (lambda (attr-node) (apply string-append (text-same-bindings attr-node))))) ;;; @defsyntax asxt-attr-setter variable ;;; ;;; Expands to: @code{(asxt-setter @var{variable} asxt-attr-value)} ;;; ;;; For an illustrative example, a program that accepts HTML in SXML form and ;;; yields a list of the URLs of @code{img} elements along with any width and ;;; height information specified for each in the HTML: ;;; ;;; @smalllisp ;;; (define scraped-imgs ;;; (compiled-asxt ;;; `((*text* *null*) ;;; (*default* (*inherit*)) ;;; (@@ *null*) ;;; (img ,(lambda (node) ;;; (let ((s #f) (w #f) (h #f)) ;;; (asxt `((@@ ((*default* *null*) ;;; (src ,(asxt-attr-setter s)) ;;; (width ,(asxt-attr-setter w)) ;;; (height ,(asxt-attr-setter h))))) ;;; (cdr node)) ;;; (vector s w h))))))) ;;; ;;; (asxt scraped-imgs ;;; '(html (body (img (@@ (height "60") (src "b.jpg") (width "80"))) ;;; (p "Stage Right: " ;;; (img (@@ (src "a.jpg") (align "right")))))) ;;; @result{} (#("b.jpg" "80" "60") #("a.jpg" #f #f)) ;;; @end smalllisp (define-syntax asxt-attr-setter (syntax-rules () ((_ ?variable) ;; TODO: Maybe add error-check for already set. (asxt-setter ?variable asxt-attr-value)))) ;;; @unnumberedsec References ;;; ;;; @table @asis ;;; ;;; @item [HtmlPrag] ;;; Neil W. Van Dyke, ``HtmlPrag: Pragmatic Parsing of HTML to SHTML and ;;; SXML.''@* ;;; @uref{http://www.neilvandyke.org/htmlprag/} ;;; ;;; @item [LGPL] ;;; Free Software Foundation, ``GNU Lesser General Public License,'' Version ;;; 2.1, February 1999, 59 Temple Place, Suite 330, Boston, MA 02111-1307 ;;; USA.@* ;;; @uref{http://www.gnu.org/copyleft/lesser.html} ;;; ;;; @item [Pre-post-order] ;;; Oleg Kiselyov, ``XML/HTML processing in Scheme: SXML expression tree ;;; transformers,''@* ;;; @uref{http://pobox.com/~oleg/ftp/Scheme/lib/SXML-tree-trans.scm} ;;; ;;; @item [SRFI-23] ;;; Stephan Houben, ``Error reporting mechanism,'' SRFI 23, 26 April 2001.@* ;;; @uref{http://srfi.schemers.org/srfi-23/srfi-23.html} ;;; ;;; @item [SSAX] ;;; Oleg Kiselyov, ``A functional-style framework to parse XML documents,'' ;;; 5 September 2002.@* ;;; @uref{http://pobox.com/~oleg/ftp/Scheme/xml.html#XML-parser} ;;; ;;; @item [STX] ;;; Kirill Lisovsky, ``STX,''@* ;;; @uref{http://www.pair.com/lisovsky/transform/stx/} ;;; ;;; @item [SXML] ;;; Oleg Kiselyov, ``SXML,'' revision 3.0.@* ;;; @uref{http://pobox.com/~oleg/ftp/Scheme/SXML.html} ;;; ;;; @item [SXPath] ;;; Kirill Lisovsky, ``SXPath - SXML Query Language,''@* ;;; @uref{http://pair.com/lisovsky/query/sxpath/} ;;; ;;; @item [WebIt] ;;; Jim Bender, ``WebIt! - An XML Framework for Scheme.''@* ;;; @uref{http://celtic.benderweb.net/webit/} ;;; ;;; @end table