bfilter - a Bayesian spam filter -------------------------------- Everyone and their dog is writing Bayesian spam filters these days. I'm no exception. The first version of this was a trivial perl script, but it was slow and crap, so I rewrote it as an enormous C program. This was also an excuse to implement a skiplist data type, though inevitably my feeling of cleverness there was punctured by a computer-scientist friend who told me that my implementation was lousy and shamed me into redoing it. Such is life (and that's the last time I trust lecture notes downloaded from the web...). About the only significant thing which this program does which is not available in Dog Brand(TM) mail filters is a moderately intelligent attempt to recognise and decode base64 data in emails. Note that bfilter doesn't actually parse emails properly (according to the rules of RFC[2]822, 1521 and so forth) but instead applies some heuristics to find chunks of base64 encoded text which extend over more than a few lines. I haven't attempted to analyse whether this is worthwhile. bfilter uses a TDB database to store the list of terms and term occurences which is used to estimate the probability that a given email is spam. The sole thing to be said for this approach is that it was easy to implement. TDB databases are big and not blindingly fast. But disk space is cheap and performance is acceptable. NOTE TO UPGRADERS: version 0.3 uses a different database type -- Andrew Tridgell's TDB, rather than GDBM -- in an attempt to save disk space and improve concurrency (TDB doesn't have exclusive-lock-on-write). There is a perl script, migrate-0.2-to-0.3, which performs this conversion for you. You will need the TDB_File module from CPAN. Installation ------------ Edit the Makefile and type make. Fix any errors which are reported. Copy the resulting bfilter executable to Where You Want It; similarly for the man page, bfilter.1. Use --- Assemble a corpus of spam and of non-spam email. The latter should include a bunch of email you've sent. Then do bfilter isspam < spam-emails bfilter isreal < non-spam-emails -- assuming that the two corpora are in the form of Berkeley mail folders (with emails separated by a blank line and a line starting `From '). The word counts are stored in the GDBM database ~/.bfildb. Now you can pass a new email through bfilter test and it will add on X-Spam-Words: and X-Spam-Probability: headers based on the terms it finds in the emails. Any existing such header is replaced. You can then use procmail or whatever to filter on the results, like this: :0 fw | bfilter test :0: * ^X-Spam-Probability: YES spam or whatever. From time to time you should run bfilter cleandb, which will discard from the database terms which haven't been used in the past four weeks. This will save some disk space. bfilter understands the headers added by SpamAssassin and can filter on them; I find this more useful than SpamAssassin's built-in scoring. CVS --- You can access the public CVS repository for bfilter: $ CVSROOT=:pserver:anonymous@sphinx.mythic-beasts.com:/home/chris/vcvs/repos $ export CVSROOT $ cvs login CVS password: # password is `anonymous' $ cvs co bfilter License, Disclaimer and Other Legal Stuff ----------------------------------------- This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. bfilter is copyright (c) 2003 Chris Lightfoot See also: http://ex-parrot.com/~chris/software.html#bfilter $Id: README,v 1.10 2004/04/18 13:56:02 chris Exp $