Thu, Feb 26, 2004

Training Pair’s Bayesian Filter

My spam levels have risen to unbearable levels. I filter both on the client side with the BayesIt plugin for The Bat! and the server side with Pair's SpamAssassin installation. I'm fed up with having to download spam though. I don't want to see it, I don't want to even think about it. So my goal is to make the server side filter better and, as with any Bayesian filter, it can be trained.

On the client side it's easy - just hit a key and BayesIt analyses the message and improved the spam filter. On the server side though there's a disconnect since I can't hit a button to tell the server what is spam and what isn't. After perusing Pair's support pages I came across a document describing how to train the bayes filter. It turns out that you can use a mailbox file to train the filter. So I created a spam collecting email address spam@blog.iandavis.com. Then I set up a filter in The Bat that redirects the selected mail to spam@blog.iandavis.com and deletes it from my inbox. The Bat has a neat redirect facility that resends a message to another address while preserving the original sender info. This means the email ends up in the spam@blog.iandavis.com inbox almost unchanged from how it appeared in mine. Hopefully this will prevent the bayesian filter from learning that forwarded messages from me are spam! I bound this to a hotkey so I can scan my inbox and redirect spam to the collecting mailbox. The final piece of the puzzle was a cron job to use the contents of the spam@blog.iandavis.com inbox to train the filter and then empty the mailbox.

So, at the press of a key I can train the server-side spam filter. Additionally, the more I mention spam@blog.iandavis.com on the web, the more the spammers will send me juicy training material. Yum yum.

Permalink: http://blog.iandavis.com/2004/02/training-pairs-bayesian-filter/

Internet Alchemy

Training Pair’s Bayesian Filter

Earlier Posts