soundex.scm - Soundex Index Keying in Scheme

Version 0.2, 2 August 2004, http://www.neilvandyke.org/soundex-scm/

by Neil W. Van Dyke <neil@neilvandyke.org>

Copyright © 2004 Neil W. Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Lesser General Public License [LGPL] for more details.

1 Introduction

This is an implementation in Scheme of the Soundex indexing hash function as specified somewhat loosely by US National Archives and Records Administration (NARA) publication [Soundex], and verified empirically against test cases from various sources. Both the current NARA function and the older version with different handling of `H' and `W' are supported. Additionally, a nonstandard prefix guessing function permits multiple Soundex keys to be generated from a string, increasing recall.

This library should work under any R5RS Scheme implementation for which char->integer yields ASCII values.

2 Characters, Ordinals, and Codes

To facilitate possible future support of other input character sets, this library employs a character ordinal abstract representation of the letters used by Soundex. The ordinal value is an integer from 0 to 25--corresponding to the 26 letters `A' through `Z', respectively--and can be used for fast mapping via vectors. Most applications need not be aware of this.

soundex-ordinal chr Procedure

Yields the Soundex ordinal value of character chr, of #f if the character is not considered a letter.

          (soundex-ordinal #\a) => 0
          (soundex-ordinal #\A) => 0
          (soundex-ordinal #\Z) => 25
          (soundex-ordinal #\3) => #f
          (soundex-ordinal #\.) => #f
          

soundex-ordinal->char ord Procedure

Yields the upper-case letter character that corresponds to the character ordinal value ord. For example:

          (soundex-ordinal->char (soundex-ordinal #\a)) => #\A
          

Note that an #f value as a result of applying soundex-ordinal is not an ordinal value, and is not mapped to a character by soundex-ordinal->char. For example:

          (soundex-ordinal->char (soundex-ordinal #\')) error-->
          

soundex-ordinal->soundex-code ord Procedure

Yields a library-specific Soundex code for character ordinal ord.

          (soundex-ordinal->soundex-code (soundex-ordinal #\a)) => aeiou
          (soundex-ordinal->soundex-code (soundex-ordinal #\c)) => #\2
          (soundex-ordinal->soundex-code (soundex-ordinal #\N)) => #\5
          (soundex-ordinal->soundex-code (soundex-ordinal #\w)) => hw
          (soundex-ordinal->soundex-code (soundex-ordinal #\y)) => y
          

char->soundex-code chr Procedure

Yields a library-specific Soundex code for character chr. This is equivalent to: (soundex-ordinal->soundex-code (soundex-ordinal chr)).

3 Hashing

Soundex hashes of strings can be generated with soundex-nara, soundex-old, and soundex.

soundex/narahw/start str narahw? start Procedure

This is an internal procedure.

          (soundex/narahw/start "van Dam" #t 4) => "D500"
          (soundex/narahw/start ".0,!"    #t 0) => #f
          

soundex-nara str Procedure
soundex-old str Procedure
soundex str Procedure

Yields a Soundex hash key of string str, or #f if not even an initial letter could be found. soundex-nara generates NARA hashes, and soundex-old generates older-style hashes. soundex is an alias for soundex-nara.

          (soundex-nara "Ashcraft") => "A261"
          (soundex-old  "Ashcraft") => "A226"
          (soundex      "Ashcraft") => "A261"
          (soundex      "")         => #f
          

4 Prefixing

Multiple Soundex hashes from a single string can be generated by soundex-nara/prefixing, soundex-old/prefixing, and soundex/p, which consider the string with and without various common surname prefixes.

soundex-prefix-starts str Procedure

Yields a list of Soundex start points in string str, as character index integers, for making hash keys with and without prefixes. A prefix must be followed by at least two letters, although they can be interspersed with non-letter characters. The exact behavior of this function is subject to change in future versions of this library.

          (soundex-prefix-starts "Smith")          => (0)
          (soundex-prefix-starts "  Jones")        => (2)
          (soundex-prefix-starts "vanderlinden")   => (0 3 6)
          (soundex-prefix-starts "van der linden") => (0 3 7)
          (soundex-prefix-starts "")               => ()
          (soundex-prefix-starts "123")            => ()
          (soundex-prefix-starts "dea")            => (0)
          (soundex-prefix-starts "dea ")           => (0)
          (soundex-prefix-starts "dean")           => (0)
          (soundex-prefix-starts "delasol")        => (0 2 3 4)
          

soundex/narahw str narahw? Procedure

This is an internal procedure.

soundex-nara/prefixing str Procedure
soundex-old/prefixing str Procedure
soundex/p str Procedure

Yields a list of zero or more Soundex hash keys from string str, based on the whole string and the string with various prefixes skipped. All elements of the list are mutually unique. soundex-nara/prefixing generates NARA hashes, and soundex-old/prefixing generates older-style hashes. soundex/p is an alias for soundex-nara/prefixing.

          (soundex/p "Van Damme") => ("V535" "D500")
          (soundex/p "vanvoom")   => ("V515" "V500")
          (soundex/p "vanvanvan") => ("V515")
          (soundex/p "DeLaSol")   => ("D424" "L240" "A240" "S400")
          (soundex/p "")          => ()
          

History


Version 0.2 -- 2 August 2004
Minor documentation change. Version frozen for PLaneT packaging.
Version 0.1 -- 10 May 2004
First release.

References


[GIL-55]
US National Archives and Records Administration, "Using the Census Soundex," General Information Leaflet 55, 1995.
[LGPL]
Free Software Foundation, "GNU Lesser General Public License," Version 2.1, February 1999, 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
http://www.gnu.org/copyleft/lesser.html
[Soundex]
US National Archives and Records Administration, "The Soundex Indexing System," 19 February 2000.
http://www.archives.gov/research_room/genealogy/census/soundex.html