SpamBayes knows spam
Outlook add-in really works to block spam, and it's freeFollow @infoworld
Thomas Bayes, a Presbyterian minister and mathematician born just over 300 years ago, would be shocked to see most of the e-mail messages that bid for our attention nowadays. He would be thrilled to know, however, that his statistical inference theorem has inspired a potent counterattack. An open source project called SpamBayes has emerged as a powerful weapon in the war on spam. There are a few different implementations of SpamBayes. I'll focus here on an Outlook add-in, written by renowned Python hacker Mark Hammond. I've been skeptical about the long-term prospects for content-based e-mail filtering. But the Python-based SpamBayes engine, and Hammond's brilliant add-in (also written in Python), are rapidly making me a believer.
Several e-mail programs, including the Mail program bundled with Mac OS X, use Bayesian techniques to enable users to train their systems to distinguish between spam and nonspam (aka ham). Experts debate how the term Bayesian is relevant to this game of classification, but the core ideas in Paul Graham's influential 2001 paper, "A Plan for Spam," make sense intuitively. Every message bears evidence both for and against the hypothesis that it is spam. Your disposition of every message tests both hypotheses and systematically improves the filter's ability to separate spam from ham.
As Graham pointed out, the judgments involved are highly individual. For example, the commercial e-mail that I want to receive (or reject) will differ from the ones you want (and don't want) according to our interests and tastes. A filter that works on behalf of a large group, such as SpamAssassin, which checks and often rewrites my infoworld.com mail, or CloudMark's SpamNet (formerly Vipul's Razor), which collaboratively builds a database of spam signatures, will typically agree with SpamBayes on what I call the Supreme Court definition of spam: You know it when you see it. What sets SpamBayes apart is its ability to learn, by observing your behavior, which messages you do want to see, and the ones you don't.
If you use Outlook 2000 or Outlook XP, it's easy -- and free -- to give the SpamBayes Outlook add-in a whirl. If you already have Python installed, you can acquire the source and set up SpamBayes and the add-in according to the usual conventions for open source packages. I did that, but because I'm well aware that typical Outlook users don't have Python installed and won't want to deal with an open source-style installation, I also tested the binary installer available at Starship Python. It worked beautifully, installing SpamBayes plus the subset of the Python needed to run it.
SpamBayes appears as a toolbar item called Anti-Spam. To use the add-in effectively, you'll need to point it to a pile of ham. These messages may simply be the contents of your inbox if you keep it squeaky clean. But they can also live in other folders. That's great news, because I use Outlook's filters aggressively to route messages from known correspondents to folders.