UrlSkip: Web URL Simplification in Scheme

Version 0.1, 3 January 2005, http://www.neilvandyke.org/urlskip/

by Neil W. Van Dyke <neil@neilvandyke.org>

Copyright © 2005 Neil W. Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License [GPL] for details. For other license options and commercial consulting, contact the author.

Introduction

The UrlSkip Scheme library provides a function that translates some of the Web URLs that might be used to track a user across sites, by removing intermediate HTTP redirectors or information that might identify the user. Such a function might be used as part of a privacy-enhancing Web browser, or to canonicalize or un-obfuscate URLs for Web analysis projects.

Note that UrlSkip is not intended to remove information used by “affiliate” referral programs to identify site operators that have sent users to a site. However, in some cases this affiliate ID information might be lost in the process of removing a intermediary URL that is used by a third party to track and profile users.

UrlSkip currently requires R5RS, the [uri.scm] library, and a particular regular expression function. Therefore, UrlSkip currently works only with PLT MzScheme, although it will be made more portable once uri.scm is.

UrlSkip is released under the GPL license, unlike most of the author's other released Scheme libraries, which are LGPL.

Host Handlers

The procedures in this section are used internally by the urlskip procedure, and correspond to particular HTTP server hostnames. They are exposed here mainly for purposes of documentation, and are likely to change in future versions of UrlSkip. Each procedure accepts a uriobj and yields either a new URL string of a simpler URL, or #f if no simpler URL was determined.

— Procedure: urlskip-http-ad-doubleclick-net uriobj

UrlSkips http://ad.doubleclick.net:
Substring following ;;~sscs=%3f.

— Procedure: urlskip-http-click-linksynergy-com uriobj

UrlSkips http://click.linksynergy.com:
If path /fs-bin/stat, then query value RD_PARM1 or rd_parm1.

— Procedure: urlskip-http-rds-yahoo-com uriobj

UrlSkips http://rds.yahoo.com:
Substring of the http URL following *-.

— Procedure: urlskip-http-service-netmeans-com uriobj

UrlSkips http://service.netmeans.com:
If path /bfast/click, then query value loc.

— Procedure: urlskip-http-web-ask-com uriobj

UrlSkips http://web.ask.com:
If path /redir, then query value bu.

— Procedure: urlskip-http-www-amazon-com uriobj

UrlSkips http://www.amazon.com:
If path /exec/obidos/redirect, then remove all query values except for tag and path.

— Procedure: urlskip-http-www-anrdoezrs-net uriobj

UrlSkips http://www.anrdoezrs.net:
Query value url.

— Procedure: urlskip-http-www-commission-junction-com uriobj

UrlSkips http://www.commission-junction.com:
If path /track/track.dll, then query value URL.

— Procedure: urlskip-http-www-google-com uriobj

UrlSkips http://www.google.com:
If path /pagead/iclk, then query value adurl.
If path /url, then query value q.

— Procedure: urlskip-http-www-qksrv-net uriobj

UrlSkips http://www.qksrv.net:
Query value loc or url.

Interface

The only real library interface is the urlskip procedure.

— Procedure: urlskip uri

Accepts a URL uri and yields a URL that is either uri or a UrlSkip simplified version of same. uri may be a string or a uriobj. If a simplified URL is yielded, it is always a string.

Tests

The UrlSkip test suite can be enabled by editing the source code file and loading [Testeez]; the test suite is disabled by default.

History

Version 0.1 — 3 January 2005
Initial release.

References

[GPL]
Free Software Foundation, “GNU Lesser General Public License,” Version 2, June 1991, 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
http://www.gnu.org/copyleft/gpl.html
[Testeez]
Neil W. Van Dyke, “Testeez: Simple Test Mechanism for Scheme,” Version 0.1.
http://www.neilvandyke.org/testeez/
[uri.scm]
Neil W. Van Dyke, “uri.scm: Web Uniform Resource Identifiers (URI) in Scheme,” Version 0.1.
http://www.neilvandyke.org/uri-scm/