WebScraperHelper: Simple Generation of SXPath Queries from SXML Examples

WebScraperHelper is intended as a programmer's aid for crafting SXPath queries to extract information (e.g., news items, prices) from HTML Web pages that have been parsed by HtmlPrag. The current version of WebScraper accepts an example SXML (or SHTML) document and an example “goal” subtree of the document, and yields up to three different SXPath queries. A generated query can often be incorporated into a Web-scraping program as-is, for extracting information from documents with very similar formatting. Generated queries can also be used as starting points for hand-crafted queries.

For example, given the SXML document doc:

(define doc
  '(*TOP* (html (head (title "My Title"))
                (body (@ (bgcolor "white"))
                      (p "Summary: This is a document.")
                      (div (@ (id "ResultsSection"))
                           (h2 "Results")
                           (p "These are the results.")
                           (table (@ (id "ResultTable"))
                                  (tr (td (b "Input:"))
                                      (td "2 + 2"))
                                  (tr (td (b "Output:"))
                                      (td "Four")))
                           (p "Lookin' good!"))))))

evaluating the expression

(webscraperhelper '(td "Four") doc)

will display generated queries like:

Absolute SXPath:           (html body div table (tr 2) (td 2))
Absolute SXPath with IDs:  (html body
                            (div (@ (equal? (id "ResultsSection"))))
                            (table (@ (equal? (id "ResultTable"))))
                            (tr 2) (td 2))
Relative SXPath with IDs:  (// (table (@ (equal? (id "ResultTable"))))
                            (tr 2) (td 2))

The queries can then be compiled with the sxpath procedure of the SXPath library:

(define query
  (sxpath '(// (table (@ (equal? (id "ResultTable"))))
               (tr 2) (td 2))))

(query doc) ⇒ ((td "Four"))

This version of WebScraperHelper requires R5RS, SRFI-11, and SRFI-16.

WebScraperHelper also comes with an advertising jingle (with apologies to greasy ground bovine additive Americana):

WebScraperHelper
helps a programmer
scrape the
Web a great deal!

See documentation for more info...

To be added to the moderated scheme-announce email list, ask neil@neilvandyke.org .

The current version of WebScraperHelper is 0.3 (2005-07-04).

You can download file webscraperhelper.scm, Scheme source code.

You can download file webscraperhelper.html, documentation in HTML format.

You can download file webscraperhelper.pdf, documentation in PDF format.

You can download file webscraperhelper.plt, a packaging for PLT Scheme.

Site © 1994-2008 Neil Van Dyke   neil@neilvandyke.org    XHTML 1.0 Strict  CSS2    Legal