Paul Graham, the co-creator of what is now the Yahoo Store, has published a strikingly effective new method of filtering spam.
I myself have long been critical of anti-spam and "family" filters, most of which are ineffective at best, and brain-dead at worst. One filter I evaluated stopped the users' PC when the word "bomb" was encountered, supposedly to stop children from using the Internet to learn how to make bombs. Unfortunately, this also stopped students from researching the Unabomber or other legitimate topics. Dumb.
Graham's new method, by contrast, is an intelligent application of the science of probability theory. In his latest iteration, his filter correctly flags 99.5 percent of spam, with 0.0 percent "false positives."
False positives are important, personal messages that anti-spam efforts incorrectly filter out. This is a huge problem for ordinary methods. As Graham explains, "For most users, missing legitimate email is an order of magnitude worse than receiving spams, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient."
Like many anti-spam crusaders, Graham himself started with an ordinary filter approach, looking for specific "bad words." This initially showed some promise. Simply filtering out all e-mails that contain the word "click," he says, correctly eliminates 79.7 percent of spam messages, while wrongly trashing only 1.2 percent of legitimate mail.
Those success rates, however, quickly degrade as more and more words are added to the "bad" list. This makes the crude filtering approach unusable.
The solution, Graham found, was to expand the technique in a statistically sophisticated way. Because all spam is trying to hype something, certain words have a high probability of indicating a spam message. Other words almost never appear in spam.
Words such as "though" and "apparently," for example, increase the probability that a message is legitimate, because spam isn't big on subtlety. At the same time, a genuine message isn't rejected simply because it uses a single instance of a term that might also appear in an adult-oriented spam message.
Instead of mere "dumb" filtering, Graham's elegant method analyzes the 15 "most interesting" words in each message. Through a technique known as Bayesian analysis, the weights of these 15 words are then used to compute the probability that a message is spam. This analysis is where his 99.5 percent accuracy rate comes from.
To get the weights, Graham ran the analysis on 4,000 spam messages and 4,000 legitimate ones. Statistically, this may not seem like many, but it's proved to be very significant.
Graham proposes that his research be used to create a "seed filter" that would become part of users' e-mail programs. Users would also be equipped with two Delete commands. One would be the regular Delete key, for genuine messages, while the other would be a Delete-As-Spam key, to be used when deleting spam messages. After a short time, each user would have an even more accurate filter, and spammers wouldn't have a single seed file that they could easily figure out a way to work around.