1 - 2 min read
Sep 05, 2006
Ever since Paul Graham published "A Plan for Spam" in August 2002 (prerequisite reading for this article), a lot of people have spent a great deal of time applying statistical methods to automatically classify email messages as spam. Generally, spam identification is a hard problem to solve given that the definition of spam can differ from person to person. Messages erroneously classified as spam, known as "false positives," are pretty much intolerable, which further compounds the problem. Statisitical classifiers show great promise in this area as they are able to automatically adjust to handle personal definitions of spam. The odd false positive shows up from time to time, but these become few and far between as the local statistical model continues to improve.
These classifiers already come in many forms. There are POP3 proxies, IMAP proxies, mail file processors, and even classifiers built directly into mail clients. I use POPFile (a na?ve Bayesian classifier in a POP3 proxy) at home with great success. Some work better than others, but with a little training, they all seem to work pretty well. Unfortunately, they have a common shortcoming: They don't cause the spammers any pain. And we all want to cause spammers pain. None of these classifiers are capable of causing the spammers any pain because the spammer is long gone by the time the classifier has the opportunity to process the message. What we need is a way to use the classifier against the spammer while the spammer is still connected.