Free Newsletters
InfoWorld Daily

InfoWorld
Log-in | Register

SpamBayes knows spam

Outlook add-in really works to block spam, and it's free

By Jon Udell  
May 16, 2003
 

Thomas Bayes, a Presbyterian minister and mathematician born just over 300 years ago, would be shocked to see most of the e-mail messages that bid for our attention nowadays. He would be thrilled to know, however, that his statistical inference theorem has inspired a potent counterattack. An open source project called SpamBayes has emerged as a powerful weapon in the war on spam. There are a few different implementations of SpamBayes. I'll focus here on an Outlook add-in, written by renowned Python hacker Mark Hammond. I've been skeptical about the long-term prospects for content-based e-mail filtering. But the Python-based SpamBayes engine, and Hammond's brilliant add-in (also written in Python), are rapidly making me a believer.

Free IT resource

Hear how top CIOs turn change into a competitive advantage.

Sponsored by HP

Free IT resource

Attend the SOA Executive Forum: Breaking SOA Bottlenecks SOAExecForum.com/may2007

Sponsored by InfoWorld



SpamBayes for Outlook

SpamBayes Project, sourceforge.net/

Excellent  9.4
criteria score weight
Setup 9 5%
Documentation 8 5%
Effectiveness 9 45%
Value 10 45%

Cost:
Free download

Platforms:
For Outlook add-in: Windows and Outlook 2000 and XP; for SpamBayes engine: Python

Bottom Line:
This powerful anti-spam weapon works with Microsoft Outlook filters and folders, trains on your own unique message database, and learns by watching you, responding to both positive and negative clues. Most important, it’s immediately effective.

About our Reviews and Scoring Methodology

Several e-mail programs, including the Mail program bundled with Mac OS X, use Bayesian techniques to enable users to train their systems to distinguish between spam and nonspam (aka ham). Experts debate how the term Bayesian is relevant to this game of classification, but the core ideas in Paul Graham's influential 2001 paper, "A Plan for Spam," make sense intuitively. Every message bears evidence both for and against the hypothesis that it is spam. Your disposition of every message tests both hypotheses and systematically improves the filter's ability to separate spam from ham.

As Graham pointed out, the judgments involved are highly individual. For example, the commercial e-mail that I want to receive (or reject) will differ from the ones you want (and don't want) according to our interests and tastes. A filter that works on behalf of a large group, such as SpamAssassin, which checks and often rewrites my infoworld.com mail, or CloudMark's SpamNet (formerly Vipul's Razor), which collaboratively builds a database of spam signatures, will typically agree with SpamBayes on what I call the Supreme Court definition of spam: You know it when you see it. What sets SpamBayes apart is its ability to learn, by observing your behavior, which messages you do want to see, and the ones you don't.

Arming Outlook

If you use Outlook 2000 or Outlook XP, it's easy -- and free -- to give the SpamBayes Outlook add-in a whirl. If you already have Python installed, you can acquire the source and set up SpamBayes and the add-in according to the usual conventions for open source packages. I did that, but because I'm well aware that typical Outlook users don't have Python installed and won't want to deal with an open source-style installation, I also tested the binary installer available at Starship Python. It worked beautifully, installing SpamBayes plus the subset of the Python needed to run it.

SpamBayes appears as a toolbar item called Anti-Spam. To use the add-in effectively, you'll need to point it to a pile of ham. These messages may simply be the contents of your inbox if you keep it squeaky clean. But they can also live in other folders. That's great news, because I use Outlook's filters aggressively to route messages from known correspondents to folders.

You'll also need to point SpamBayes to a big pile of spam. In my case, that folder was called NotToMe, where an Outlook filter has long been accumulating messages that are neither To: nor CC: my primary e-mail addresses. This simple rule is so effective at filtering spam that it was my sole defense until InfoWorld installed SpamAssassin a few months ago. But lately, as I'm sure you've noticed, the volume of spam has exploded. Even with SpamAssassin, the hassle of plucking the few wanted messages from my NotToMe folder, plus the growing amount of spam sent to my primary e-mail addresses (and not caught by SpamAssassin), spurred me to take the next step.

After you finish training, you designate another folder -- I called mine MaybeSpam -- for dubious messages. This third category is an extra wrinkle added by SpamBayes to the binary spam/ham technique spelled out in Paul Graham's original paper. Messages can present conflicting evidence -- that is, they score high (or low) for both ham and spam. In these cases, SpamBayes asks you for a ruling.

So long, spam

Given this setup, you turn on filtering and observe over time. The add-in runs inbound messages to your inbox (or other designated folders) through the SpamBayes classifier. Then it routes what is certainly spam to the Spam folder, and what might be spam to the MaybeSpam folder. All other messages land in your inbox, or wherever your regular filters normally route them. But every message gets tagged with a user-defined field that stores its "spamminess" percentage. You can add this field to customized Outlook views of your folders, and sort on it -- a useful way to gauge how well you've trained the system.

When a wanted message lands in Spam or MaybeSpam, you use the Recover from Spam button to restore it to its original folder, and train it as a good message. Likewise, you use Delete As Spam to nuke an unwanted message that lands in one of your "good" folders, and train it as a bad message.

Results, for me, were immediate and spectacular. SpamBayes nailed a number of spams that SpamAssassin let through. SpamAssassin was fooled by a penis enlargement ad in Spanish, for example, while SpamBayes nailed it. But other catches involve subtler discrimination. It appears that SpamBayes really can learn to distinguish between messages about legitimate products and services that I care about, and messages about equally legitimate stuff that doesn't matter to me. Can messages that are merely off-target really be defined as spam? I won't quibble. Life is short. If software can make my computer act like the intelligent assistant it's supposed to be, bring it on.

The real test, I suppose, will be in the months and years to come, as SpamBayes succeeds or fails to adapt to my evolving interests and tastes. Time may also reveal other interesting applications of SpamBayes; see the discussion on my Weblog.

Meanwhile, a minor miracle has occurred. I actually look forward to fetching my e-mail. Scanning the many messages landing in Spam, and marking them as read, is quick because there have so far been no -- I repeat, no -- false positives. I haven't yet delegated ultimate power to my new assistant; I still review its decisions. But my confidence grows daily, and I'm close to routing the crap straight to the bit bucket where it belongs.





 


 
Jon Udell is lead analyst and blogger in chief at the InfoWorld Test Center.

  More of Jon Udell's column
  Jon Udell's Weblog

Newsletter Check out all of our free newsletters!
Enter e-mail address:




 

TOP NEWS:


»  Four quick tips for choosing an IM security product
71 percent of businesses will invest in real-time messaging this year. If you're one of them, be sure to protect your enterprise

»  Forrester analysts ID hot IT jobs
Research group finds 16 IT roles with a promising future

»  Nvidia claims 10 hours of HD video on Tegra chip
The Tegra 600 and 650 can be used with hard disk drives and are designed partly for mobile Internet devices

»  Database vendors add Google's MapReduce
Greenplum and Aster Data Systems will support Google's programming technique, developed for parallel processing of large data sets across commodity hardware

»  Network management: Tips for managing costs
New technologies, changing requirements, and ongoing equipment maintenance and upgrades cost money, but there are ways to manage expenses

»  EMC targets SMBs, branch offices with new low-end storage
Celerra NX4 highlights include thin provisioning, snapshot technology for data recovery and backups, and Web-based console for management of storage volumes




MIGRATING TO VISTA
Join Windows Vista Expert, Richard Whitehead as he presents the benefits and challenges of migrating to Windows Vista. Sponsored by Novell

»  Click here to view this Webcast
  The Path to Enterprise Security
This is your comprehensive guide to Enterprise Security. In it you'll find solutions to the most pressing security threats facing you and your company. Learn the latest on insider threats and how to effectively minimize risk within your organization. Sponsored by Nokia

»  Click here to download now

- Special Advertising Partners -
WHITE PAPERS
 

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
INFOWORLD MARKETPLACE
 
» BUY A LINK NOW
 

FIND PRODUCTS AND COMPANIES
» COMPLETE PRODUCT GUIDE



TECHNOLOGY INDEX
• Applications
• Application Development
• Security
• Networking
• Wireless
• Platforms
• Hardware
• Data Management
• Storage
• Web Services
• Business
• Telecom
• Professional Services
• Standards

TECH WATCH 


What's the 411 on GOOG-411?
Just as Google has become synonymous with "performing a Web search," 411 is understood to mean "information" -- as in "what's the 411?" I was thus surprised to discover, from a billboard, no less, that the king of search is taking on the ...

Apple HTML source reveals 'iPhone Extreme'
"This one's a stretch..." reports AppleInsider. Um, yeah. Reporting on HTML code sightings of product names could be called a stretch, but iPhone Extreme has a ring to it. Now, that sounds like the product Apple should have released first, rather ...

COLUMNISTS

Unified under law
Ephraim Schwartz's Column and Blog (InfoWorld) - In the litigious world we live in, deploying a unified communications platform in your enterprise could...
» MORE COLUMNISTS

MORE INFOWORLD BLOGS


Open Sources 
Product Management
When I joined MySQL four years ago, there was quite a lot of debate about product management. We didn't actually have ...

Zero Day 
Botnet herders tending smaller flocks
New research backs up the theory that botnet operators are keeping their networks smaller in a continued effort to keep ...



• Advice Line
• Database Underground
• The Deep End
• Enterprise Mac
• Geeks in Paradise
• Grid Meter
• The Gripe Line
• InfoWorld Daily
• Inside IT
• IT Troubleshooter
• ITXtreme
• Open Sources
• ProdBlog
• Real World SOA
• Reality Check
• Security Adviser
• SMB IT
• The Storage Network
• Tech Watch
• Virtualization Report
• Zero Day

ADVERTISEMENT


RESOURCE CENTERadvertisement 

GOVERNMENT IT & POLICY
'If you don't go after the network, you're never going to stop these guys. Never.'
From the State Department, All the News for Inquiring Minds
TechPresident, the Internet Citizenry's New Consensus Taker



Sponsored Technology Links

 
 
 HOME  NEWS  BLOGS  PODCASTS  VIDEOS  TECHNOLOGIES  TEST CENTER  EVENTS  CAREERS   About | Advertise | Awards | RSS | Contact Us 

Copyright © 2008, Reprints, Permissions, Licensing, IDG Network, Privacy Policy, Terms of Service.
All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses,
phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

CIO :: ComputerWorld :: CSO :: Demo :: GamePro :: Games.net :: IDG Connect :: IDG World Expo
Industry Standard :: IT World :: JavaWorld :: LinuxWorld :: MacUser :: Macworld :: Network World :: PC World :: Playlist