The CRM114 & Mailfilter HOWTO -Bill Yerazunis , 2003-09-18 (last update 2004-04-28) This is the CRM114 Mailfilter HOWTO. It describes how to set up CRM114 and Mailfilter to filter your incoming mail, as of the version CRM114-20040328-BlameStPatrick. This HOWTO doesn't describe _how_ CRM114 or Mailfilter works. This just will set you up enough so that you can start using CRM114 and Mailfilter to filter your mail. It assumes you are running on a Linux box; getting the system running on *BSD, MacOS, or Windows will require considerably more work than we describe here (and is a subject for future HOWTOs). Remember, CRM114 and Mailfilter are released under the GPL (license is enclosed in any of the downloads). There is NO WARRANTY WHATSOEVER for this software to be useful in any way; it's going to tamper with your incoming mail and you can easily imagine the dangers in that. That said, I hope CRM114 and Mailfilter is useful to you; it's been very useful to me. It's been keeping my mailbox clear of clutter for since 2002 I'm convinced it has better performance than I-the-human at killing spam without accidentally deleting important mail. I've tested myself, and I-the-human is only about 99.7% or 99.8% accurate at best; CRM114 is considerably more accurate than that - easily two or three times more accurate. (as of December 2003, it was 99.95% accurate (N+1 statistics) on my incoming mail stream to a non-business account. ------------------------------------------------------------------- Step 0: Scientes Inamicae (Know Thy Enemy) These are the major steps in using CRM114 Mailfilter. The steps are pretty simple: 1) Downloading what you need (it's just 1 or 2 files .gz files) 2) Setting up the executables (not more than ten commands to type, even if you're building from the fresh source) 3) Setting Up your .css files (not more than 2 files to edit of no more than 5 lines each, plus typing one or two commands) 4) Configuring Mailfilter (editing one file, most likely change is ONE line, and we tell you which one) 5) Engaging Mailfilter (if you are using Procmail, this is cut-and-paste about ten lines, otherwise it's create one file containing one line, and typing up to three commands) 6) Training CRM114 and Mailfilter (whenever you get an error, you send it back to yourself, using your current mail tool. How hard can that be?) 7) Adding Priority Lists, Whitelists, and Blacklists CRM114 supports whitelists, blacklists, term rewriting, and some other features. You can use these for "gauranteed delivery" from people you really trust - or really hate. 8) Useful Utilities Details on the cssutil, cssdiff, and cssmerge utilities. You don't need to know this, but you may find it useful. ------------------------------------------------------------------------- Step 1: Downloading. Get yourself a copy of a CRM114 kit. The kits can always be found by visiting the CRM114 homepage at: http://crm114.sourceforge.net You will need at least the statically-linked binary kit (compiled to run on any i386 or better Linux box); for best performance it is suggested you get the source kit and compile it on the processor you will be running CRM114 on. If you do not have root privs on the box you will be running CRM114 on, it is suggested you stay with the statically linked binaries (this is because the recommended "TRE" REGEX library requires either root to install, or major workaround mojo). The kits are named: crm114-.i386.tar.gz (statically linked binaries) and crm114-.src.tar.gz (complete source code + tests) and crm114-.i386.rpm (statically linked .rpm package) These kit .gz files are fairly small; usually less than one megabyte (currently around 800 Kbytes) so they will download quickly. You will need to decide if you will be starting off with a pre-learned set of .css files (.css means CRM114 Sparse Spectra) or if you will be creating your own .css files from your own samples of spam and nonspam. In general, the pre-learned .css files will give you an initially more accurate filter, but after some use and training the self-created filter files will catch up with pre-learned files, and then the self-created filter files will become _more_ accurate in the long-term. If you decide that you want to start with the pre-learned .css files, you will also need to download: crm114-.css.tar.gz The .css files are rather large; this download may approach 50 megabytes. (currently it's 8+ megabytes) Download the kits you will need (at least one of .src.tar.gz or .i386.tar.gz or .i386.rpm) and then proceed to "Step 2: Setting Up the Executables" -------------------------------------------------------------------------- Step 2: Setting Up the Executables In this step, you will install four binaries into your system. The four binaries are: crm - the CRM114 "compute engine". It's called "crm" because "crm114" is too hard to type. cssutil - the .css file check/verify/edit program cssdiff - the .css file diff program cssmerge - the .css file merging program One important point: do NOT install CRM114 or any of it's utilites setuid or sgid to root. If you do, that's just an invitation for someone to utterly hose your system without even trying. We're not talking an intentional attack, just an inadvertent command or script gone wierd could do it. ----- There are three ways you can set up these executables. You can: a) install with a .rpm kit b) install with a .i386.tar.gz (tarball of statically linked binaries) c) install with a .src.tar.gz (tarball of complete source) Note 1: If you do not have root on the machine you are installing on, you may have some problems during the installation. You may want to reconsider using the statically linked binaries instead of compiling from sources. ----- Method A: Installing from .rpm (note- we don't have a good RPM for the current rev, so this section is not really accurate) Become root, then type: rpm -ivh crm114-.rpm and it'll all happen automagically. Now, you can test the install. A quick test is to type: crm -v which should report back the version of CRM114 you have just installed. You can also run a quick "Hello, world!" by typing: crm '-{ output /Hello, world! This is CRM114 version :*:_crm_version: .\n/}' then hit ^D (end-of-file on *nix). You;ll get back a response like: Hello, world! This is CRM114 version 20040118-BlameEric . If this works, you can proceed on to the next step - "Step 3: Setting Up Your .CSS Files" ----- Method B: Installing from .i386.tar.gz This method takes a few more commands to perform. First, untar the binary release. Type: tar -zxvf crm114-.i386.tar.gz You should now become root. If you do not have root on your machine, you _can_ execute CRM114 programs directly from your home directory, by changing your $PATH appropriately; see your shell man page for how to do this for your particular shell (it varies with the shell, so I can't tell you here how to do it) and skip to the end of this step. Once you're root, type: cd crm114- make install_binary_only This will install the pre-built binaries of CRM1114 and the utilities into /usr/bin. This is the default install location for CRM114. If you want them installed in a different place, edit the Makefile and change INSTALL_DIR (near the top of the Makefile) to a different directory. Note that if you type "make clean" you'll _delete_ your prebuilt binaries, so don't do that! Now, you can test your work. Type crm -v which will cause CRM114 to print out the version of itself you just installed. You can also run a quick "Hello, world!" by typing: crm '-{ output /Hello, world! This is CRM114 version :*:_crm_version: .\n/}' then hit ^D (end-of-file on *nix). You;ll get back a response like: Hello, world! This is CRM114 version 20040118-BlameEric . Congratulations! You've now completed the installation of CRM114 and utilities from prebuilt binaries. Proceed to "Step 3: Setting Up Your .CSS files". ----- Method C: Compiling from .src.tar.gz (source) This method is the most complex. Start by uncompressing and untarring the big .src.tar.gz with the command: tar -zxvf crm114-.src.tar.gz Now cd down into the crm114- directory. You will see many files here. You now have a choice: you can build CRM114 with either the GNU regex libraries (not recommended, as GNU regex can't handle embedded NULL bytes and has other issues), or with the TRE regex library (recommended; this is what you get with the precompiled binary kit). By default, you will use the TRE regex library; however, this means you have to build and install TRE. You can either grab the most recent vesion from the TRE homepage at http://laurikari.net/tre, OR you can use the version that is pre-packaged with your CRM114 download. (The pre-packaged version is tested against CRM114- the fresh one may have new features. Take your choice- it's good stuff either way) Fortunately, building and installing TRE is easy. The TRE regex library can peacefully coexist on the same system as the GNU regex library. To install TRE, become root, then type this (don't forget to tell configure to "--enable-static" ) : cd crm114- cd tre- ./configure --enable-static make make install You have now installed the TRE regex library as /usr/local/lib/libtre . Depending on your choices in static versus dynamic linking, you _may_ need to also add /usr/local/lib to /etc/ld.so.conf, and then run ldconfig as root. Or not. If, during the next steps, you get annoying messages on the order of "can't find ltre" then this is the thing to try. Once TRE is built and installed you can compile CRM114 and the accompanying utilities (cssutil, cssdiff, and cssmerge). By default, CRM114 installs into /usr/bin (_not_ /usr/local/bin - if you want to change this, change the definition of INSTALL_DIR near the top of the file "Makefile"). Change directory back up to the CRM114 directory, then become root, then (noting that no .configure step is necessary) type: cd .. make clean make install This will compile, link, install, and strip the executables (stripping gets rid of unnecessary debugging information and makes the executables load faster and use less memory). You can test your installation of CRM114. Just type: crm -v and CRM114 will report back the version of the install. You can also run a quick "Hello, world!" by typing: crm '-{ output /Hello, world! This is CRM114 version :*:_crm_version: .\n/}' then hit ^D (end-of-file on *nix). You;ll get back a response like: Hello, world! This is CRM114 version 20040118-BlameEric . Congratulations! You've now completed the installation of CRM114 and utilities from source. Move on to the next step - "Step 3: Setting Up Your .CSS Files" . ----- If you _really_ want to test your installation, you can run it against "megatest.sh", which attempts to test every code path in the system (well, all of the non-error paths at least). Coverage is incomplete, but at least it's a strong confidence indicator. Note that this only works if you've installed the TRE engine. The GNU regex engine has enough "fascinating behaviors" that it will get a lot of things wrong; the GNU regex package also doesn't handle approximate regexes at all, and since those are in the test set, you'll error out on each of those as well. The easy way to run megatest is: make megatest which will report back any differences between what your local install of CRM114 did and what the "known correct" results are. If there are any differences between the supplied "megatest.log" and your own results, OTHER than process IDs in the "MINION PROC" results, please file a bug report to me and we'll figure out what went wrong. ------------------------------------------------------------------ Step 3: Setting Up The .CSS files The .css files ( CRM114 Sparse Spectra files) are the "memory" that crm114 uses to statistically describe the words and phrases that characterize various kinds of mail. You have a choice here. You can either use the pre-learned .css files available from crm114.sourceforge.net , or you can build your own .css files dynamically as spam and nonspam email come in. In either case your .css files should be in the same directory as your mailfilter will "run" in (yes, this can be changed, but that's an advanced topic). The particular directory that the mailfilter "runs" in is variable and depends on your local setup. Assuming you will use the ".forward" hook, there are two likely situations. If your mail service runs on your local machine (say, you have just one machine - and I do hope you have a firewall in that case), then mailfilter will almost certainly "run" in your home directory- the directory you're in when you log in. If your mail service runs on a mail server (not your local machine), then you will probably have a "home directory" on that machine as well, and that's the directory that the mail filter will run in. If neither of these is the case, you should ask your system administrator what the correct directory is. ----- Method A - Build Your Own Empty .CSS Files This method will give you the best final accuracy, but you will spend more time training. This is the recommended method for users wanting the best accuracy. To start from scratch, you need to create empty .css files. The cssutil program will do that for you. Just type: cssutil -b -r spam.css cssutil -b -r nonspam.css and you will have created _empty_ spam.css and nonspam.css files in your current directory (that is, the files are full-size, but contain no information. They'll be full of binary zeroes). Once you have these empty files you will have a high (50% or so) error rate for the first few hours, till you have 'taught' CRM114 what your particular mix of spam and nonspam looks like. Proceed below to "Step 4: Configuring Mailfilter". Many people want to "preload" their spam collection into CRM114. This is a bad idea. CRM114 is optimized for TOE learning - "Train Only Errors" learning; testing something like a quarter of a million test cases has proven that it is better to train only errors, and _only_ _as_ _they_ _occur_, than to preload a bulk database into CRM114. The statistics from the "torture test" (about 40,000 messages) are that training _only_ errors, in realtime, will give about 2.1 times better accuracy than force-training a big corpus, even if the messages are the same messages and presented in the same order. The "why" is mathematically complicated, but there's an intuitive description in the FAQ. Again: you will achieve the best possible accuracy if you let CRM114 itself make errors that you correct in real time. ----- Method B - Pre-LEARNed files: This is the simplest method, but less accurate than method A. If you choose to use the pre-learned .css files, you need to download the appropriate crm114 .css.tar.gz file, and then you can just type: tar -zxvf crm114-.css.tar.gz and you'll get the two files "spam.css" and "nonspam.css" in your current directory. Note that the download is fairly large - between 8 and 50 megabytes, and although this will give you a good starting point for your own statistics, you will have a better (smaller, faster) final configuration if you build your own .css files from scratch. ----- Method C - Build And Preload .CSS Files From Fresh Spam and Nonspam If you really feel you must start by preloading some sample spam, copy your most recent 100Kbytes or so of your freshest spam and nonspam into two files in the current directory. These files MUST be named "spamtext.txt" and "nonspamtext.txt" They should NOT contain any base64 encodes or "spammus interruptus", straight ASCII text is preferred. If they do contain such encodes, decode them by hand before you execute this procedure. Remember- if you do it this way, you will NOT achieve the same level of accuracy as if you use method A (training only errors, as they occur) above. The only reason you might ever do it this way is if you need some spam filtering _NOW_ and accept that you are operating with a suboptimal filter. This filter will be worse by about a factor of 2.1 in accuracy and a factor of two worse in speed than one built in the optimal way (that is, method A). That said, here's how to proceed: You should use approximately equal amounts of spam and nonspam. Edit the file "rewrites.mfp" and replace the placeholders (in this case, "wsy", "merl.com", and "mail.merl.com") with your corresponding username, domain name, and mail server information. These rewrite rules will be used to "scrub" your sample text of user-specific strings. (note that this is only strictly necessary if you want to use the pre-built .css files. However, it is in general recommended, so that you can "share/merge" your .css files with your friends.) Note the "arrowheads" in the file. They look like this: >-> This is a rewrite operator. Anything that matches the regex on the left-hand side of the arrowhead will be replaced with the text on the right-hand side of the arrowhead. Example: if your name was Agent Smith, your email account AgentSmith@the.matrix.org, and your mail router was mail.matrix.org at IP address 192.168.10.5, then the rewrites.mfp file should look like: AgentSmith@the.matrix.org>->MyEmailAddress [[:space:]]Agent Smith>-> MyEmailName mail.matrix.org>->MyLocalMailRouter 192.168.10.5>->MyLocalMailRouterIP The idea is to turn your email headers into headers that don't refer to any of your own actual name, address, etc, but contain only the strings "MyEmailAddress", "MyEmailName", "MyLocalMailRouter", and "MyLocalMailRouterIP". If you have more than one incoming email name , email address, server, router, etc, add lines in rewrites.mfp for each email name, email address, server, router, and so forth. This is something you really _should_ do, if you have more than one email path leading to the account that leads to an account that is being filtered by CRM114 (if you don't, a lot of learning will have to be repeated for each path, which will cost you accuracy and use up valuable feature slots in the .css files that you could use in more valuable ways otherwise.) Finally, type: rm -rf spam.css rm -rf nonspam.css make cssfiles to build your new spam.css and nonspam.css files. Again, let me emphasize that doing this kind of "fast build" will lead to a final filter that is _less_ accurate and learns _slower_ than a filter that is only trained on realtime spam/nonspam errors. ----- CHECKING YOUR .CSS FILES For all three methods of setting up your .css files, you can check that the .css files are reasonable. Use the "cssutil" utility: cssutil -b -r spam.css cssutil -b -r nonspam.css You should get back a report something like this: Sparse spectra file spam.css statistics: Total available buckets : 1048576 Total buckets in use : 506987 Total hashed datums in file : 1605968 Average datums per bucket : 3.17 Maximum length of overflow chain : 39 Average length of overflow chain : 1.84 Average packing density : 0.48 Note that the packing density is 0.48; this means that this .css file is about half full of features. Once the packing density gets above about 0.9, you will notice that CRM114 will take longer to process text. The penalty is small below packing densities below about 0.95 and only about a factor of 2 at 0.97 . Note - do NOT believe "ls -la" with respect to .css files! Because CRM114 uses memory mapping instead of file I/O (because it's much faster to go through the page-fault tables than through the file I/O system), the m_time and c_time never change, only the a-time, and that only if your file system had the proper compile-time options to keep track of the a_time. Believe in what cssutil tells you- if new features show up after learning, you _are_ learning and "ls -la" is lying to you! You can also see how easy it will be for CRM114 to differentiate spam from nonspam with your .css files. The utility "cssdiff" will compare the statistical features of two .css files. Try it: cssdiff spam.css nonspam.css and you'll get back a report like: Sparse spectra file spam.css has 1048577 bins total Sparse spectra file nonspam.css has 1048577 bins total File 1 total features : 1605968 File 2 total features : 1045152 Similarities between files : 142039 Differences between files : 1279964 File 1 dominates file 2 : 1463929 File 2 dominates file 1 : 903113 Note that there's a big difference between the two files; in this case there are about 10 times as many differences between the two files as there are similarities. That's pretty much typical. Now, move on to "Step 4: Configuring Mailfilter". ------------------------------------------------------------------------ Step 4: Configuring Mailfilter In this step you will tell Mailfilter what you want it to do with your mail. All of the options are controlled by editing one file, named "mailfilter.cf" . By default, Mailfilter looks for mailfilter.cf in the initial directory. If you use "--fileprefix=/some/where/else/" on the command line, mailfilter.crm will look for mailfilter.cf (and the other runtime filtering files!) in the "/some/where/else/" directory. This --fileprefix mode is handy when you are setting up many users. The format of mailfilter.cf itself is pretty simple. 0) blank lines are OK. 1) comments start with a # in column 1. 2) Anything not a comment is a var setting, in the format: :var_to_set: /Value_to_set_goes_here/ All of the user-settable configuration vars have setup lines in mailfilter.cf. First, you MUST change the secret password. This is defined near the top of the file. Your password may contain a-z, A-Z, 0-9, but no blanks or punctuation (at least for now). You _must_ set this password to something not easily guessable. If you don't set it, you won't be able to use mailfilter's remote commanding facility. At first, you will probably want to leave the "log_to_allmail.txt" enabled while you get used to CRM114. Likewise, leave "log_rejections" set to yes as well; that way you can easily see (with "less" or "tail") just what is being rejected. Once you get more experience with CRM114, you can set these to "no" and not use up disk space in these "extra safety" logs. You can skim-read the rest of mailfilter.cf . There are three typical cases for most users: 1) If you are using Procmail: --> You probably will NOT need to change any of the other options. 2) If you are NOT using Procmail, and your mail reading program can sort out email into folders based on whether the SUBJECT header contains the telltale string "ADV:" (most mail readers can do this): --> You probably will NOT need to change any of the other options. 3) You are NOT using Procmail, and your mail reading program is "dumb" (cannot sort email into folders based on subject line): --> You probably will want to define a separate account that will recieve all spam caught (otherwise, you'll just get all your spam delivered as usual, with additional headers telling you it was spam). To do this, look down to ":general_fails_to:". Insert the full username@domainname.tld mail address where you want your spam to be sent. You can also configure the verboseness (or not) of your filtered results. You can go from "no changes" (not even a statistical label in the headers) to complete results including an expansion of any base64 texts and HTML decommented strings. Feel free to change things to get the look and feel you want; after all, what good is open source if you don't change it? :) HOWEVER, Please don't muck with variables that aren't in the mailfilter.cf file. "You make a mess, you clean it up." :-) After making these changes, write out "mailfilter.cf". You may later go back and change the configuration options, but the options as already set are good for most users. You do not need to do anything to "load in" the new options, as CRM114 reads them in fresh from the file during initialization for each email. Now, edit the file "rewrites.mfp". Make the changes to insert your name, your domain, your local mail router, and your local mail router's IP address as specified by the placeholders. (again, strictly speaking this is not absolutely necessary, but it's good hygene and will allow you to swap and merge .css files with your friends) If you have more than one possible mailserver, mail router, domain, etc. you can add extra lines to rewrites.mfp as desired. This is very handy for systems that have more than one IP address accepting mail. ----- Once you have set up mailfilter.cf and rewrites.mfp, you can test your configuration by typing the following (The '^D' at the end is a control-D, which is an END-OF-FILE on Linux. Other systems may use a different END-OF-FILE character): ./mailfilter.crm This is a test. Just type a few lines of text that you might ordinarily get, like a short rant on why Perl is useless for big projects, or why Linux is superior or inferior to NetBSD. ^D If you have set up Mailfilter for Procmail-style filtering you will always get a small report back saying something like either of these (the actual numbers will change, but you should have something that _vaguely_ looks like the following): From foo@bar Thu Sep 18 19:20:35 2003 X-CRM114-Status: Good ( pR: 12.630237 ) ** ACCEPT: CRM114 PASS SBPH/BCR TEST** Probabilistic match quality: 1.000000, pR: 12.630237 P(succ): 1.000000e-00, P(fail): 2.342950e-13 Features: 336, S hits : 4313, F hits : 5901 or: From foo@bar Thu Sep 18 19:19:39 2003 X-CRM114-Status: SPAM ( pR: -2.866484 ) ** REJECT: CRM114 FAIL SBPH/BCR TEST** Probabilistic match quality: 0.001358, pR: -2.866484 P(succ): 1.358082e-03, P(fail): 9.986419e-01 Features: 144, S hits : 2337, F hits : 3313 If you are using "mail to spamtrap account" filtering, then you will either get an "accept" report back (the first report above is an "accept") or the text you typed in will be mailed to your spamtrap address. If you don't get a report back, check the spamtrap address and see if your test text ended up there. If you don't get _either_ of the above, something is broken, either in your installation of CRM114 or in your configuration file. You need to fix the problem before you engage Mailfilter. If your installation and configuration passes the above test, congratulations! You have now configured mailfilter.crm . Onward, to "Step 5: Engaging Mailfilter". ---------------------------------------------------------------------------- Step 5: Engaging Mailfilter There are two common ways to engage Mailfilter.crm on your incoming mail stream: you can use Procmail recipes and have Mailfilter run as a procmail subprocess, or you can use the .forward hook of Sendmail (and Sendmail clones which also support .forward) In the first method (recommended), you use Procmail's ability to execute a program as part of a Procmail recipe to run CRM114, which adds headers as needed to let Procmail or your mail-reading program do the sorting. In the .forward method, you (or your system manager) must add a link from an execution command of crm114 to the directory /etc/smrsh. This is because sendmail will NOT run any program that isn't "approved" by the system manager (by linking it into /etc/smrsh/whatever). The output of mailfilter is then directly appended to your /var/spool/mail file (or possibly forwarded to your spam-bucket account). ----- Method A: For Procmail Users For Procmail users just add a procmail recipe to .procmailrc to run CRM114 and mailfilter whenever your other procmail rules fail to decide what to do. Here's a sample Procmail recipe set. Notice that we actually have TWO recipes - one to actually run crm114 and mailfilter, the other to then sort the mail based on the result. # # :0fw: .msgid.lock | /usr/bin/crm -u /home/my_user_directory mailfilter.crm :0: * ^X-CRM114-Status: SPAM.* mail/crm-spam That's all that Procmail users should need. Mailfilter should now be active - send yourself a test message and see where it ends up. If you get the test messsage, proceed to "Step 6: Training CRM114". ( note: Sub-Method A-one) If you use an MUA that can highlight on headers, you can use something like this in your procmail (from Philipp Weiss): in .procmailrc CRMSCORE=`$HOME/bin/crmstats.sh` :0fw: .formail.crm114.lock | formail -I "X-CRM114-Score: $CRMSCORE" where ~/bin/crmstats.sh is a simple script: #!/bin/bash grep -a -v "^X-CRM114" | \ /usr/bin/crm -u $HOME/.crm114 mailfilter.crm --stats_only Advanced Topic: Huge Emails and Denial Of Service Avoidance CRM114 has a built-in anti-Denial-of-Service (anti-DoS) feature in that it will not grow buffers beyond a certain limit. However, you may find that you actually recieve emails bigger than this limit. In these cases, it is effective to simply filter on the first few tens of kilobytes of incoming text. This is easy to do with "head". head -c 10000 gives the first 10,000 characters of input, which is usually adequate for CRM114 to get a good decision on. This can be directly piped in right in the procmail command: :0fw: .msgid.lock | head -c 10000 | /usr/bin/crm -u /home/my_user_directory mailfilter.crm :0: * ^X-CRM114-Status: SPAM.* mail/crm-spam ----- Method B: The .forward hook file For .forward hook users you should be aware that you should NOT put a direct link to crm in /etc/smrsh; since crm can do arbitrary things, you ought to attempt to control the damage as much as possible. 1) add a link from /etc/smrsh to crm114's executable binary in /usr/bin by becoming root and typing: cat > /etc/smrsh/crmfilter /usr/bin/crm mailfilter.crm >> /var/spool/mail/your_account_name_here ^D 2) add a .forward file to your account by typing: cat > .forward |/etc/smrsh/crmfilter ^D That's all. The mailfilter should now be active - send yourself a test message and see where it ends up. ---- Once you have engaged CRM114 mailfilter, you now get to train it to recognize spam and nonspam. Proceed to "Step 6: Training CRM114". Note: CRM114 contains a design decision that you may have to play with. Instead of doing memory management games, which both consume significant runtime CPU as well as present a major denial-of-service opportunity, CRM114 has an upper limit on the window size and it simply won't exceed that limit (it gives an error message if an incoming message tries to exceed the limit) You -can- change the maximum memory limit at runtime with the -w nnnnn flag; for example, if you want 100 megabytes of memory available, you can set that with ... -w 100000000 to set 100,000,000 bytes as the hard limit ceiling on memory usage. --------------------------------------------------------------------------- Step 6: Training CRM114 and Mailfilter One of the great strengths of CRM114 Mailfilter is that it has no preconcieved notions of "spam" and "nonspam". It _learns_ what you consider spam, and what you consider nonspam. For the first few days CRM114 will make a lot of mistakes sorting spam and nonspam. It is _very_ important that you train each mistake back into CRM114, otherwise it will never learn what you consider spam or nonspam. You should train in the mistake as quickly as possible. Start one morning and try to train every hour for the first few hours at least. Don't think you're training a computer- pretend you're housebreaking a new puppy. You train mistakes right from your mail reader. There are two ways to do this... The first way is to use the built-in command feature. Just forward the mistake back to yourself, with full headers (except edit out any CRM114-added headers or text). Just before the first line of the text to be "learned" as spam or nonspam insert a COMMAND line. Everything from the command line to the end of the message will be learned (so edit the text to remove things you _don't_ want considered indicative of spam/nonspam nature). The command line looks like this: command spam or command nonspam The "c" in "command" must be in column 1, and you must put your secret password into the command line. Examples: If your secret password was "Ihatespam", then the command line to learn something as spam would be: command Ihatespam spam and the command to learn something as nonspam would be: command Ihatespam nonspam The second way to train in spam and nonspam is to use mailfilter.crm's command line options. When you find a spam that was mistakenly accepted as good mail, pipe it through mailfilter.crm with the "--learnspam" flag set, like this: bash> mailfilter.crm --learnspam < the_spam.txt Likewise, if you get an email that was falsely classified as a spam, pipe it through mailfilter with the "--learnnonspam" flag set, like this: bash> mailfilter.crm --learnnonspam < the_NON_spam.txt (yes, if you have a scriptable mail reader, you can put these functions right on the menu bars somewhere. Yes, that's a hint. :) ) For both ways: try to train _approximately_ equal amounts of spam and nonspam. If you are within 50% one way or the other, performance will be very good. Train only errors! This is called TOE training. (TOE :== Train Only Errors) It's not necessary to train near-misses; experiments show that the performance increase on training near misses is miniscule at best, and may be negative at times. It's best for at least the first day or so, you check your mail at least every hour or so and send training information back to CRM114. This will help it rapidly converge on a good set of statistics for your particular mix of spam and nonspam. It will take several days worth of errors for CRM114's mailfilter to approach 95% accuracy, and around two weeks to a month to reach 99+ per cent accuracy. I usually exceed 99.9% accuracy (less than one error per thousand). What To Do if CRM114 says "LEARNING UNNECESSARY..." --------------------------------------------------- Occasionally, some CRM114 configurations may refuse to learn an errror, claiming that it "got it right the first time" (yes, this is a subtle bug that is not allowing itself to be found, but there is reason to believe it has to do with the interaction of mail clients and headers and that some mail readers don't give you the headers, the full headers, and nothing but the headers.) While we applaud this self confidence, the error is still there, so you need to "force" the learning. You can do this either from BASH or from the mail-to-yourself command line. For BASH, add "--force" to the command line; for mail-to-yourself commands, just add "force" From BASH, add --force to the command line: # mailfilter.crm < the_error_text --learnspam --force for mail-to-yourself, add "force" to the command line: command mysecretpassword spam force (and similarly for nonspam). ----------------------------------------------------------------------- Step 7: Adding Priority Lists, Whitelists, and Blacklists If you really want, you can add white, black, and priority lists to CRM114. Most people don't need them, but there are always exceptions. For example, your lawyer, your boss, and your paramour all probably rate being on your "whitelist", so whatever they send to you is always marked "nonspam". Likewise, your ex-girlfriend/boyfriend, your nagging acquaintance, and the stalker from the library should all get blacklisted. Whitelisting, blacklisting, and prio-listing are all based on regex matching. If the regex you put in the file "whitelist.mfp" matches the incoming mail _anywhere_, the mail will be marked "good" no matter how it scores statistically. Similarly, if the mail matches any regex in "blacklist.mfp", the mail will be marked as "spam", no mattter how it compares statistically. Note that sometimes this can cause considerable confusion, for example "ac.com" in a whitelist will not just match "billing.ac.com", but also "drac.complete.viagra.sales.com" (the match being the 'ac.com' in "drac.complete"). To prevent this, use ^ and $ to "anchor" the start and end of the regex, if possible. Lastly (well, actually firstly, because prio-listing happens before whitelisting or blacklisting) any mail that matches any regex in priolist.mfp . The format of priolist.mfp is that the first character on the line is a + or a -, which indicates "whitelist" or "blacklist", and the rest of the line is a regex. These regexes are tested in the order given in the file. An empty file is perfectly acceptable. For examples of how to set up the whitelist, blacklist, and priolist files, see the included "whitelist.mfp.example", "blacklist.mfp.example", and "priolist.mfp.example". Note: for my accuracy tests, I *turn off* whitelists, blacklists, and prio-lists. Be sure to test any whitelist, blacklist, or other list that you add, otherwise you may get a rude surprise some day. ---------------------------------------------------------------- Step 8: Useful Utilities You don't _need_ to know the stuff in this section to set up and use CRM114 and mailfilter but it might be useful to you- or at least satisfy some of your curiosity. There are three utilities for dealing with the .css files (these are the files that contain the "learned information"). The utilities are: cssutil - gives you a readout of the characteristics of the information in a .css file cssdiff - gives you a summary of the differences between two .css files (handy for seeing learning!) cssmerge - merges two .css files into one; handy for importing new data into a .css file. Note that this is a destructive operation on the first .css file named! The cssutil utility: Usage is cssutil somefile.css which will give you statistics on the file somefile.css. You can then rescale, clip, and otherwise manage your .css files. It is especially useful to check the "Average Packing Density" of the .css files you use; when it approaches .7 to .8, you may want to consider enlarging your .css file. To do that, see below on "Enlarging a .css file" Here's the -h help: Usage: cssutil [-b -r] [-s css-size] cssfile -h - print this help -b - brief; print only summary -r - report then exit (no menu) -s css-size - if no cssfile found, create new cssfile with this many buckets. -S css-size - same as -s, but round up to next 2^n + 1 boundary. The cssdiff utility ------------------- To get the difference between two .css files, use ./cssdiff somefile.css anotherfile.css which writes out a summary of how two different .css files are. The cssmerge utility -------------------- To merge two .css files, use cssmerge . ./cssmerge outfile.css infile.css Note that this is _destructive_ to outfile.css, so make a copy somewhere else first. You _CAN_ merge two .css files of different length. You can also expand (or contract) a .css file this way: rename the old file, and allow a new one to be created with learnspam or learnnonspam while using the '-s nnnnnnnnn' s(lots) flag to set the number of feature slots desired in the new file. Then cssmerge your old file into the fresh new file, and all is well. Here's the cssmerge help: Usage: cssmerge [-v] [-s] will be created if it doesn't exist. must already exist. -v -verbose reporting -s NNNN -new file length, if needed Enlarging a .css file --------------------- One of the advantages of CRM114 is that the .css files are relatively small and of fixed size; they don't grow out of control and never need trimming if you use , which is the default in mailfilter.crm . The disadvantage of this is that if your spam/nonspam discrimination is too convoluted, it won't be able to sort them out ( in trek-speak this is a high-order nonlinearity in the discrimination function ). The fix in this situation is to increase the dimensionality of the feature space. The number of dimensions is about 1/12 the number of bytes in the .css files; this works well at about a million dimensions (12 megabytes) for most people. But if you're not most people, you may need to (eventually) increase it. You can tell when this is necessary- running cssutil will give you a utilization and percentage of slots full; when that gets up near 95 percent, you may be running low on space and old features will be erased to make room for new features (that is, your feature set will dynamically evolve in real time to find what works.) However, that's slow and may cause a slight loss of accuracy. One way to fix this is to "increase the dimensionality of the discrimination hyperspace" (no, I am not making that phrase up). It means to add new slots to the .css files. The easiest way to do this is to 1) use cssutil to create a temporary, empty, larger .css file 2) merge the data from the old, small .css file onto the new big file. 3) copy the new big file over the old, small file. You can even combine steps 1 and 2, because newer versions of cssmerge will create a new file if needed (the -s N flag sets the number of slots in the new file; -S N does the same thing but rounds up to a 2^N+1 boundary, which is recommended ). For example, here's how to increase the size of the spam.css file from 1,000,001 slots (the default) to 2,000,001 slots. Just type: cssmerge temporary.css spam.css -s 2000001 mv temporary.css spam.css The newly replaced spam.css will have all of the features of the old spam.css file, but will be 2000001 slots long instead of the default 1000001 slots. --------------------------------------------------------------------- That's all! If you have errors or updates (or find bugs!) please let me know; the best way is to join the CRM114-general mailing list; it's on the webpage: http://crm114.sourceforge.net and ask there. The reason for using the mailing list rather than personal email is that personal email isn't archived, but the mailing list _is_ both archived and read widely, so we not only create a background archive of solutions but you will get a better answer back faster than if you sent the email to me alone. Enjoy, and good luck. -Bill Yerazunis