Chat Threader

Internet Relay Chat (IRC) discourse typically comprises a complex structure of multiple simultaneous threads of conversation that frequently branch. Chat Threader is an experimental software system that attempts to identify the thread structure without natural language understanding.

This capability has application in software agents that inhabit the on-line chat world, and user interfaces that provide structure visualizations and thread-based filtering.

irc

In the example of actual IRC discourse below, Foo begins talking about his keyboard, Bar asks a question, Foo answers in multiple messages and says something more about the keyboard, Bar digresses and then asks Foo another question, and Foo starts a largely new thread. This is actually a relatively simple example, as there is only one dominant thread.

00:18:24 <Foo> i got myself a 191 programmable keyboard today.
               its quite impressive.  i can program any key to
               contain up to 52 scan codes of 'regular' keys.
00:18:47 <Bar> does it have additional function keys?
00:18:52 <Foo> oh yeah.
00:19:01 <Foo> keys comming out of its wazooo
00:19:13 <Foo> has a built in 3 track card reader too.
00:19:20 <Bar> when i finally get an nt box, i'd like to have
               window management keys like the sun keyboards
00:19:30 <Nuk> install linux
00:19:30 <Bar> 3-track?
00:19:43 <Nuk> or any unix, not win nt
00:19:49 <Foo> bar: it will read all three tracks of any card.
               ie: credit card, phone card ect
00:20:00 <Bar> i'm a unix weenie, but i figured it was time to
               try nt
00:20:23 <Nuk> weenie ?
00:20:34 <Nuk> y go from something good to something bad
00:21:22 <Foo> but this 3d graphics engine i witnessed today
               was amazing.

The current Chat Threader implementation is written in Emacs Lisp (long story) and assembles the input stream of IRC events into a set of tree structures representing threads of evolving discourse focus. It uses a variety of techniques for matching events to their parents:

  • Parses input into event objects, presently focusing on normal and action messages (channel PRIVMSG, and channel CTCP ACTION PRIVMSG).

  • Identifies some message kinds, such as boolean-question, why-question, other-question, unknown-question, and boolean-answer.

  • Identifies some addresses of public messages to specific users (e.g., "<Foo> nova: i got it for $42" and "<Bar> What's the frequency, Kenneth?").

  • Generates many of the likely abbreviations of nicks for better matching of addresses (e.g., nick MrTofu_3 has case-insensitive abbreviations mrtofu, tofu, mt, mr, mrt, mt3, and mrtofu3).

  • Restricts matching of messages to a window of c event count and t elapsed time (currently magic numbers), to allow for human attention span and typical display size under varying traffic levels.

  • Identifies keywords (excluding closed-class words) and looks for identical and related words in other messages. The WordNet lexical database is used for related words.

  • Translates the words, abbreviations, and stylistic word-manglings of many on-line dialects (e.g., Y U W4n+ 2b Tr4NzL8inG uR 3733+ D14L4k7z D00DZ!?!!?11?*%#???) to more canonical English, for easier processing.

Chat Threader was my final project for a class in 1997. The initial approach proved moderately successul in limited evaluation. Examination of the results suggested refining the approach to use a probabilistic model. My class paper also identified several other ideas for further work. I would be interesting in sharing the program and data with someone who is serious about pursuing this line of work.

© Copyright Neil Van Dyke      Contact