Inside the spam filter
Anti-spam solutions rely on a variety of increasingly sophisticated techniques to block spam
Anti-spam solutions use a variety of techniques to check the contents of e-mail, gathering information from all parts of the message including the header, body, and any attachments. A basic technique of spam filtering involves checking the header of the message for the IP address of the original sender and comparing it to a whitelist or blacklist. Blacklists are lists of addresses of known spammers, and whitelists are lists of senders whose e-mail should be allowed through even if it appears to be spam. The filter may also look for signs that the message header has been forged to hide the original sender.
Content checking is the basis for most anti-spam technology, and includes simple filtering based on certain words, attachment types (such as MP3 files), signatures, heuristics, and statistical analysis. The problem with simple filtering -- filtering all messages containing the words “Viagra” or “spam,” for example -- is that not only might there be legitimate messages containing those words, but that it’s easy to change the words, either by deliberate misspelling (“V!agra”) or by using an HTML message and inserting invisible characters between the visible ones. The same is true of signatures, where the filter looks for content similar to known spam. The tools spammers use to conceal or obfuscate the content of messages that filters look for continues to become more sophisticated.
As spammers become more adept at bypassing filters, anti-spam vendors must find more sophisticated methods of detecting spam. Heuristics uses a series of rules to score an e-mail, so that a message might get one point for containing the word “Viagra,” one point for a “click here” link, one point for a price ($19.99), one point for a “click here to unsubscribe” link, and one point for a URL that points to a known spam site. A score of three or more might get the message quarantined.
Bayesian filtering uses statistical analysis to detect spam. It looks at content and assigns a probability that a document is spam based on the number of documents defined as spam (or not spam) with similar content. Thus, a message containing “Viagra” would have a certain probability of being spam, but a message containing “V!agra” would have a much higher probability of being spam, since legitimate e-mail wouldn’t likely include the misspelling. Likewise, a message containing “RFP” (request for proposal) or “process,” for example, would have a very high probability of being legitimate mail.
All the products I tested use a combination of these techniques. Some combine heuristics with blacklists, or Bayesian analysis with source IP checking. As my tests show, today’s commercial products are extremely good at identifying spam. But even the best of them ultimately block some legitimate mail, if only because marketing e-mails and newsletters that people want have the same characteristics as those they do not want. All enterprise anti-spam solutions include whitelists, because no system is perfect.