August 20, 2002

Paul Graham provides stunning answer to spam e-mails

Probability theory shows impressive results

Paul Graham, the co-creator of what is now the Yahoo Store, has published a strikingly effective new method of filtering spam.

I myself have long been critical of anti-spam and "family" filters, most of which are ineffective at best, and brain-dead at worst. One filter I evaluated stopped the users' PC when the word "bomb" was encountered, supposedly to stop children from using the Internet to learn how to make bombs. Unfortunately, this also stopped students from researching the Unabomber or other legitimate topics. Dumb.

Graham's new method, by contrast, is an intelligent application of the science of probability theory. In his latest iteration, his filter correctly flags 99.5 percent of spam, with 0.0 percent "false positives."

False positives are important, personal messages that anti-spam efforts incorrectly filter out. This is a huge problem for ordinary methods. As Graham explains, "For most users, missing legitimate email is an order of magnitude worse than receiving spams, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient."

Like many anti-spam crusaders, Graham himself started with an ordinary filter approach, looking for specific "bad words." This initially showed some promise. Simply filtering out all e-mails that contain the word "click," he says, correctly eliminates 79.7 percent of spam messages, while wrongly trashing only 1.2 percent of legitimate mail.

Those success rates, however, quickly degrade as more and more words are added to the "bad" list. This makes the crude filtering approach unusable.

The solution, Graham found, was to expand the technique in a statistically sophisticated way. Because all spam is trying to hype something, certain words have a high probability of indicating a spam message. Other words almost never appear in spam.

Words such as "though" and "apparently," for example, increase the probability that a message is legitimate, because spam isn't big on subtlety. At the same time, a genuine message isn't rejected simply because it uses a single instance of a term that might also appear in an adult-oriented spam message.

Instead of mere "dumb" filtering, Graham's elegant method analyzes the 15 "most interesting" words in each message. Through a technique known as Bayesian analysis, the weights of these 15 words are then used to compute the probability that a message is spam. This analysis is where his 99.5 percent accuracy rate comes from.

To get the weights, Graham ran the analysis on 4,000 spam messages and 4,000 legitimate ones. Statistically, this may not seem like many, but it's proved to be very significant.

Graham proposes that his research be used to create a "seed filter" that would become part of users' e-mail programs. Users would also be equipped with two Delete commands. One would be the regular Delete key, for genuine messages, while the other would be a Delete-As-Spam key, to be used when deleting spam messages. After a short time, each user would have an even more accurate filter, and spammers wouldn't have a single seed file that they could easily figure out a way to work around.

Close

On Twitter now

Business

Powered by Twitter

On Twitter now

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive Business Resource Alerts

Subscribe to the Today's Headlines: First Look Newsletter

Find out what will be news for the day, with our first-thing-in-the-morning briefing.

©1994-2009 Infoworld, Inc.