# $Id: NEWS,v 1.43 2004/08/19 11:34:24 vanbaal Exp $ CRM114 NEWS - user visible changes (and some other changes also.) Refer to ChangeLog for detailed per-file info. Refer to README for news about changes in the non-autoconfiscated CRM114 core (some of the contents here is copied from that file). Version 20040816.BlameClockworkOrange-auto.3, 2004-08-19 (JvB) * Makefile.am: make sure TODO gets installed * ChangeLog: generated by cvs2cl(1) * man/cssdiff.azm, man/cssmerge.azm, man/cssutil.azm: manpages updated. Thanks Bill Yerazunis and Shalendra Chhabra. * man/crm.azm: new short intro manpage added. crm114(1) is still available too: crm(1) and crm114(1) are _different_ pages now. * src/crm_main.c: applied patch from Johan Petersson. Fixes a bug which caused crm114 to exit 0 no matter what argument you use for the exit statement. See Message-Id: <6.1.1.1.2.20040818184005.054339b8@mail.trilithium.net> on crm114-general@lists.sourceforge.net. * examples/whitelist.mfp.example: ship this file, which is used in some tests, too. tnx Paolo. Version 20040816.BlameClockworkOrange-auto.2, 2004-08-18 (JvB) * docs/Makefile.am: make sure QUICKREF.txt gets distributed and installed (Thanks Ondřej Surý). * configure.ac, tests/Makefile.am: ship 5 more tests from Bill (Thanks Ondřej Surý). Version 20040816.BlameClockworkOrange-auto.1, 2004-08-18 (JvB) * Build from crm114-20040816.BlameClockworkOrange.src (there were no autoconfiscated versions for 20040723.BlameNashville, 20040808.BlamePekingAcrobats, 20040815.BlameClockworkOrange.) * src/Makefile.am: tools now depend on crm_css_maintenance.c. Version 20040627-BlameSeifkes-auto.crm.20040629 * Build from new release from Bill crm114-20040627-BlameSeifkes.src.tar.gz (there were no autoconfiscated versions for crm114-20040513-BlameSiefkes.src.tar.gz, crm114-20040601-BlameKyoto.src.tar.gz, crm114-20040606-BlameTamar.src.tar.gz, crm114-20040617-BlameSeifkes.src.tar.gz, crm114-20040625-BlameSeifkes.src.tar.gz) Version 20040419-BlameEasterBunny-auto.1, 2004-04-20 * updated for BlameEasterBunny (minor files added, removed) (PEP) Version 20040409-BlameMarys-auto.1, 2004-04-12 * updates for BlameMarys (a couple new expr*.c files) * added tests/testscript.sh to run all tests Version 20040328-BlameStPatrick-auto.1, 2004-04-08 * major surgery on sledgehammer and prebootstrap * cookfunc.sh (and splitfunc.awk) no longer needed * patching sysincluds seems to not be needed * first autoconfiscated release of Blame St. P. * (surgery by Peter Popovich, some other changes by Raul Miller) Version 20040221-BlameYokohama-auto.2, 2004-02-25 * Finetuning of crm114 manpage. Thanks Seth Hanford. (JvB) * No longer ship QUICKREF.txt since it is in crm114(1). Version 20040221-BlameYokohama-auto.1, 2004-02-25 * Build from new public release crm114-20040221-BlameYokohama.src.tar.gz. (JvB) Version 20040212-BlameJetlag, some time around 2004-02-15 (notes from Raul Miller, about autoconfiscated version) * Build from BlameJetlag (other notes) * README.1st updated. Link to list archive added. version 20040207-BlamePaoloMore-auto.1, 2004-02-07 (notes from Joost van Baal, about autoconfiscated version) * Build from new hidden release crm114-20040207-BlamePaoloMore-auto.1.src.tar.gz. * Make sure manpages are retypeset when shipping new tarball: MAINTAINERCLEANFILES in man/Makefile.am added. * Added Bill's original NEWS, as shipped in README file, to this file. Trimmed README accordingly. version 20040206-BlamePaolo-auto.1, 2004-02-07 (notes from Joost van Baal, about autoconfiscated version) * Build from new hidden release crm114-20040206-BlamePaolo.src.tar.gz. * misc/crm114.spec and misc/rename-gz are gone: RPM will be maintained, build and published by Peter Popovich. version February 2, 2004 This is a release of CRM114 and Mailfilter. The last few known bugs have been stomped (including a moderately good infinite loop detector for string rewrites, and a "you-didn't-set-your-password" safety check), the classifier algorithms have been tuned (default is full Markovian), and it's been moderately well tested. Accuracies over 99.95% are documented on real-time mail streams, and the overall speed is 3 to 4x faster than SpamAssassin. My thanks to all of you whose contributions of brain-cycles made this code as good as it is. -Bill Yerazunis version 20040118-BlameEric-auto.1, 2004-01-22 (notes from Joost van Baal, about autoconfiscated version) * Build from new hidden release crm114-20040118-BlameEric.src.tar.gz: New in this release from Bill: This one needed a major change to the way configuration worked in order to let the CAMRAM people do what they need. There's nothing really scarey in it, so it's a "virtual sanity check" version.K However, it DOES introduce some new code and I'm not comfortable with putting it out there till it's better tested. There is ONE MAJOR CHANGE. BE WARNED: mailfilterconfig.crm IS GONE! mailfilter.cf IS NEW AND HERE! (this was forced in order to support --userdir) WHAT's NEW IN THE NEW RELEASE: 0) --userdir is now supported. Usage is mailfilter.crm --userdir=/home/my/user/dir/ ^ | DON'T FORGET THE CLOSING SLASH! This is a genuine prefix and if you leave off the slash you will get _nothing_. This doesn't do a "cd wherever" like -u does; instead this changes the prefix of all files referenced by mailfilter.crm (and it does it the hard way, in the script. Not with a "cd") 1) mailfilterconfig.crm is No More. It's totally deprecated. 1a) the new configuration file is "mailfilter.cf" 2) likewise, camrammer.crm is totally deprecated. Use --stats_only in mailfilter instead. 3) mailfilter.cf is NOT an insert file. Instead, there's a little mini-parser that runs in mailfilter.crm that reads mailfilter.cf and sets up variables. The syntax is extremely obvious when you read the "mailfilter.cf" file. :) 4) learnspam.crm and learnnonspam.crm are likewise out of date and totally deprecated. Use mailfilter --learnspam and --learnnonspam respectively instead. 5) some file permissions ought to be better than they were before. It seems right to me, but let me know. -Bill Yerazunis * Changed prebootstrap to add a leading ``#include config.h'' to src/crmregex_tre.c. This fixes a bug in 20040111a-v1.0-SanityCheck-auto.1: that release failed to build on systems with libtre installed. version 20040118 (Notes from Bill Yerazunis, on non-autoconfiscated version) It turns out that CAMRAM needs (as in is a virtual showstopper) the ability to specify which user directory all of the files are to be found in. Since #insert _cannot_ do this (it's compile time, not run time), mailfilter.crm (and classifymail.crm) now have a new --fileprefix=/somewhere/ option. To use it, put all of the files (the .css's, the .mfp's etc) that are on a per-user basis in one directory, then specify mailfilter.crm --fileprefix=/where/the/files/are/ Note that this is a true prefix- you must put a trailing slash on to specify a directory by that name. On the other hand, you can specify a particular prefix on a per-user basis, e.g.: mailfilter.crm --fileprefix=/var/spool/mail/crm.conf/joe- so that user "joe" will use mailfilter.crm with these files: /var/spool/mail/crm.conf/joe-mailfilter.cf /var/spool/mail/crm.conf/joe-rewrites.mfp /var/spool/mail/crm.conf/joe-spam.css /var/spool/mail/crm.conf/joe-nonspam.css and so on. Note that this does NOT override --spamcss and --nonspamcss options; rather, the actual .css filenames are the concatenation of the fileprefix and spamcss (or nonspamcss) names. version 20040111a-v1.0-SanityCheck-auto.1, 2004-01-14 (notes from Joost van Baal, about autoconfiscated version) * Build from new hidden release crm114-20040111a-v1.0-SanityCheck.src.tar.gz (missed crm114-20040107-1.0-SanityCheck.src.tar.gz) * We now support both GNU regexp library as shipped with libc, as well as Ville Laurikari's libtre. On systems lacking a libtre, we fallback to using the GNU regexp library. Since GNU regex needs ``#include '', we patched src/crm114_sysincludes.h, to take ./configure's output into account. We now ship Bill's crmregex_gnu.c source file, in order to facilitate this. * README.1st updated: note on status of this sideproject, not on relevance of original build instructions. * megatest.log is renamed to megatest_knowngood.log by Bill Version 20040105 (recheck) (notes from Bill) The only fixes here are to make the Makefile a little more bulletproof and lets you know how to fix a messed-up /etc/ld.so.conf, and of course this document has been updated. Otherwise this version should be the same as the December 27 2003 (SanityCheck) version, which has no reported reproducible bugs higher than a P4 (documentation and feature request). For the last two weeks, I had _one_ outright error and two that I myself found borderline out of about 5000 messages. That's 2x better than a human at the same task. My thanks to all of you whose contributions of brain-cycles made this code as good as it is. -Bill Yerazunis version 20040102-1.0-SanityCheck-auto.1, 2004-01-06 * Build from new official release crm114-20040102-1.0-SanityCheck.src.tar.gz version 20031229-1.0-SanityCheck-auto.2, 2004-01-01 (notes from Joost van Baal, about autoconfiscated version) * cssdiff(1), cssmerge(1), cssutil(1) manpages added. * HACKING gets distributed and installed now. * All .crm scripts have a flexible !#-line: path/to/crm gets adapted, according to ./configure's --prefix. * A symlink bindir/crm to crm114 gets created. version 20031229-1.0-SanityCheck-auto.1, 2003-12-30 (notes from Joost van Baal, about autoconfiscated version) * Build from new official release crm114-20031229-1.0-SanityCheck.src.tar.gz * New C sourcefile splitting: changed the prefix on the split files from "crm114_" to just "crm_". The only ones that stay as "crm114_" are the .h files. * Install libexec/crm114/* and doc/crm114/examples/tests/* as executables. * Install pad.crm, pad.dat, shroud.crm (these got lost in the file reshuffling) * Start building support for ./configure-time expanding of shebangs in crm scripts. * Added explicit run-time check to configure.ac, in order to catch host with half-setup libtre early. Version 20031227 (SanityCheck) (notes from Bill) This is (hopefully) the last test version before V1.0, and bug fixes are minimal. This is really a sanity check release for V1.0 . It is now time to triage what needs to be fixed versus what doesn't, and very few things NEED to be fixed. Things that changed (or not) are: 1) BUGS ACTUALLY FIXED: removed the arglist feature from mailfilter.crm; there's a poorly understood bug in NetBSD versus Linux that breaks things. allmail.txt flag control wasn't being done correctly. That's fixed. a couple of misleading comments in the code are fixed. 2) THINGS THAT ARE NOT CHANGED IN THIS VERSION BUT ARE V1.1 CANDIDATES: the install location fix is NOT in V1.0. This will move the location of the actual binary (/usr/bin/crm versus /usr/local/bin/crm- and then add a symlink /usr/bin/crm --> /usr/local/bin/crm- ) the --mydir feature of mailfilter.crm is not yet implemented and won't be in V1.0 . Expect it in V1.1 Other than that and a few documentation fixes, this version is identical to 20031217. It's just the final sanity check before we do V1.0 version 20031219-RC12.6, 2003-12-22 (notes from Joost van Baal, about autoconfiscated version) * Oops, configure.ac was looking for c++: this bug is fixed. Furthermore, ./configure now exits in absence of lib TRE. Cleaned up configure.ac: removed some bogus AC_CHECK_FUNCS checks, a.o. *.mfp files now get installed in doc/crm114/examples/crmfilter/ . version 20031219-RC12.5, 2003-12-21 (notes from Joost van Baal, about autoconfiscated version) * Splitted crm144.c in multiple source files, using Paolo P's scripts. version 20031219-RC12.4, 2003-12-20 (notes from Joost van Baal, about autoconfiscated version) * Restructured tarball layout according to Paolo P's ideas. (Splitting sources still to do.) version 20031219-RC12.3, 2003-12-19 (notes from Joost van Baal, about autoconfiscated version) * Ships and install example .crm's too, as well as documentation. New layout of tarball: split stuff among directories. version 20031219-RC12.2, 2003-12-19 (notes from Joost van Baal, about autoconfiscated version) * Now installs some docs. version 20031219-RC12.1, 2003-12-19 (notes from Joost van Baal, about autoconfiscated version) * Autoconfiscated test release. version 20031219-RC12, 2003-12-19 * Release by Bill Yerazunis. Version 20031215-RC11 (notes from Bill) Minor bugs smashed. Math evaluation now works decently (but be nice to it). Mailfilter accuracy is up past 99.9% (less than 1 error per thousand, usually when a spammer joins a well-credentialed list and spams the list, or a seldom-heard-from friend sends a one-line message with a URL wrapped in HTML). Command line features for CAMRAM added ("--spamcss" and "--nonspamcss"; these will probably become unified to a --mydir). Lots of documentation updates; if it says something in the documentation, there's actually a good chance it works as described. version 20031129-RC11.1, 2003-12-18 (notes from Joost van Baal, about autoconfiscated version) * First test release of autoconfiscated branch. version 20031129-RC11, 2003-11-29 * Release by Bill Yerazunis. Version 20031111-RC7 (notes from Bill) More bugs smashed- there are still a few outstanding bugs here and there, but you aren't likely to find them unless you're really pushing the limits. Improvements are everywhere; You can now embed the classical C escape chars in a var-expanded string (e.g. \n for a newline) as well as hex and octal characters like \xFF and \o132.) EVAL now can do string length and some RPN arithmetic/comparisons; approximate regexing is now available by default, and the command line input is improved. Version 20031101-RC4 (November 1, 2003) (notes from Bill) The only changes this release are some edge-condition bugfixes (thanks to Paolo and JSkud, among others) and the inclusion of Ville Laurikari's new TRE 0.6.0-PRE3 regex module. This regex module is tres-cool because it actually has a useful approximate matcher built right in, dovetailed into the REGEX syntax for #-of-matches. Consider the regex /aaa(foo){1,3}zzz/ . This matches "foo", "foofoo", or "foofoofoo". Cognitively anything in a regex's {} doesn't say what to match, just how to match it. The cognitive jump you hve to take here is /foo{bar}/ can have a {bar} that says _how accurately_ to match foo. For instance: foo{~} finds the _closest_ match to "foo" (and it always succeeds). The full details of approximate matching are in the quickref. Read and Enjoy. (for your convenience, we also include the well-proven 0.5.3 TRE library, so you should install ONE and ONLY one of these. Realize that 0.6.0-PRE3 is still a fairly moderately tested library; install whichever one meets your need to bleed. :-) ) version Oct 23, 2003 ( version 20031023-RC3 ) (notes from Bill) Yes, we're now at RC3. Changes are that EVAL now works right, lots of bugfixes, and the latent code for RFC-compliant inoculation is now in the shipped mailfilter.crm (but turned off in mailfilter.cf) All big changes are being deferred to V1.1 now; this is bugfix city. Make it bleed, folks, make it _bleed_. -Bill Yerazunis version October 15, 2003 (notes from Bill) It's been a long road, but here it is - RC1, as in Release Candidate 1. WINDOW and HASH have been made symmetrical, the polynomials have been optimized, and it's ready. Accuracy is steady at around 3 nines. Because of all the bugfixes, upgrading to this version (compatible with the BETA series) is recommended. -Bill Yerazunis Version This is the September 25th 2003 BETA-2 (notes from Bill) What's new: a few dozen bugs stomped, and new functionality everywhere. Command line args can now be restricted to acceptable sets; will keep your .css files nicely trimmed; ISOLATE will copy preexisting captures, --learnspam and --learnnonspam in mailfilter.crm will perform exactly the same configured mucking as filtering would, and then learn; --stats_only will generate ONLY the 'pR' value (this is mostly for CAMRAM users), positional args will be assigned :_posN: variables, the kit has been split so you don't have to download 8 megs of .css if you are building your .css locally, and it's working well enough that this is a full BETA release. Version August 07, 2003 bugfix release. (notes from Bill) Changes: lots and lots of bugfixes. Really. The only new code is experimental code in mailfilter (to add 'append verbosity as attachment') and getting WINDOW to work on any variable, everything else is bugstomping or enhanced testing (megatest.sh runs a lot of tests automatically now). There's still a bug or dozen out there, so keep sending me bug reports! (and has anyone else done the cssutil --> cssmerge to build small .css files for fast running?) Version This is the July 23, 2003 alpha release. (notes from Bill) This release is a bugfix release for the July 20 secret release. Fixes include: configuration toggles for allmail.txt and rejected_mail.txt, execution time profiling works, (-p generates an execution time profile, -P now limits number of statements in program), Good news: the new .css file format seems to be working very well; although we spend a little more time in .css evaluation, the accuracy increase is well worth it (I've had _one_ error since 07-20, a false accept to a mailing list that came back as "marginally nonspam" because the mailing list is usually squeaky clean). Merging works well; you can now make your .css files as big (or small) as you dare (within reason; you'll need to throw away features if you want to compress the heck out of it and you'll use lots of memory or page like crazy if you make them too big). If experiment shows that this memory usage is excessive, let me know and I'll see if I can do a less-space-for-more-time tradeoff. Profiling indicates that we spend more time in blacklist processing than in the whole SBPH/BCR evaluator, (which isn't that surprising, when you get down to it), so maybe trimming the blacklist to people who spam _you_ would be a good performance improvement. Anyway, here you go; this is a _recommended_ release. Grab it and have fun. :) As usual, prior news and updates are at the end of this file. Version 2003 July 19 (notes from Bill) This is the July 19, 2003 SECRET alpha release. It won't be linked on the webpage- the only people who will know about it are the ones who get this email. Y'all are special, you know that? :-) Since this is a SECRET release, you all have a "need to know". That need is simple: I'd like to get a little more intense testing on this new setup before I put it out for general release. Enough has changed that you _need_ to read ALL the news before you go off and install this version. Be AFRAID. :) LOTS of changes have occurred - the biggest being that the new, totally incompatible but far better .css format has been implemented. The new version has everything you all wanted- both for people who want huge .css files, and for people who want _smaller_ .css files. This new stuff has necessitated scouring cssutil and cssdiff so don't use the old versions for the new format files. Lastly, because the old bucket max was 255 and the new is 4 gigs, the renormalization math changed a little. Expect pRs to be closer to 0 until you train some more. Accuracy should be better, even _before_ training, so overall it's a net win. There's also string rewriting in the pre-classification stage (who wanted that? Somebody did....) and since term rewriting is so darn useful, I'm releasing an expurgated version of the string rewriter I use to scrub my spam and nonspam text of words that should not be learned. This scrubber automatically gets used if you "make cssfiles". Here's the details: 1) The format of the .css files has changed drastically. What used to be a collisionful (and error-accepting) hash is now a 64-bit hash that is (probably) nearly error free, as it's also tagged with the full 64-bit feature value; if two values clash as to what bucket they would like to use, proper overflow techniques keep them from both using the same bucket. Bucket values were maxxed at 255 (they were bytes) now they're 32-bit longs, so you are _highly_ _unlikely_ to max out a bucket. These two changes make things significantly more robust. These changes also make it possible (in fact, trivial) to resize (both upward and downward!), compress, optimize, and do other very useful things to .css files. Right now, the only supported operation is to _merge_ one .css file onto another... but the good news is that now these files can be of different sizes! So, the VERY good news is that you can look at your .css files with cssutil, decide if (or where) you want to zero out less significant data, and then use dd to create a blank, new outfile.css file that will be about half to 2/3 full, then use cssmerge outfile.css infile.css to merge your infile.css into the outfile.css. This will be a real help for people who have (or need) very large OR very small .css files. :) You can create the blank .css file with the command 'dd' as in: dd bs=12 count= if=/dev/zero of=mynew.css (the bs=12 is because the new feature buckets are 12 bytes long) Because chain overflowing is done "in table, in sequence" you can't have more features than your table has feature buckets. You'll get a trappable error if you try to exceed it. Minor nit- right now, feature bucket 0 is reserved for version info- but it's never used (left as all 0's). That's no major hassle, but just-so-you-know... :) 2) A major error in error trapping has been corrected. TRAPs can now nest at least vaguely correctly; a nonfatal trap that is bounced does not turn into a fatal. Also, the :_fault: variable is gone, each TRAP now specifies it's own fault code. This isn't to say that error trapping is now perfect, but it's a darn sight better than it was before. 3) term rewriting on the matched or learned text is now supported; this will mean significant gains in out-of-the-box accuracy as well as keeping your mail gateway name from becoming a spam word. :) Far more fancy rewritings can be implemented, if you should choose. The rewriting rules are in rewrites.mfp - YOU must edit this to match your local and network mailer service configuration, so that your email address, email name, local email router, and local mail router IP all get mapped to the same strings as the ones I built the distribution .css files with. 4) Minor bugs - a minor bug (inaccurate edge on matching) for the polynomial; annoying segfault on insert files that ended with '#' that were immeidately followed by a { in the main program was fixed; 5) a new utility is provided - rewriteutil.crm. This utility can do string rewriting for whatever purpose you need. I personally use it to "scrub" the spam and nonspam text files; the file scrub_mailfile_rewrites.mfp contains an (expurgated) set of rewrite rules that I use. You will need to edit scrub_mailfile_rewrites.mfp to put your account name and password in, otherwise you'll be using mine (and losing accuracy) For examples on the term rewriting, both in the mailfilter and in the standalone utility rewriteutil.crm, just look at the example/test code in rewritetest.crm (which uses the rewrite rules in test_rewrites.mfp) Version July 1, 2003 alpha release. This is a further major bugstomping release. The .css files are expanded to 8 megabytes to decrease the massive hash-clashing that has occurred. UNION and INTERSECTION now work as described in the (updated) quickref.txt, with the (:out:) var in parens and the [:in1: :in2: ...] vars in boxes. A major bug in LEARN and CLASSIFY has been stomped; however this is a "sorta incompatible" change and you are encouraged to rebuild your .css files with a hundred Kbytees or so of prime-grade spam and nonspam (which has been stored for you in spamtext.txt and nonspamtext.txt). The included spam.css and nonspam.css files are already rebuilt for the corrected bug in LEARN and CLASSIFY. These .css files are also completely fresh and new; I restarted learning about a week ago and they're well into the 99.5% accuracy range. Version June 23, 2003 alpha release. This is a major bugstomping release. and now seem to work more like they are described to work. The backslash escapes now are cleaner; you may find yuor programs work "differnently" but it _should_ be backward_compatible. The preprocessor no longer inserts random carriage returns. A '\' at the end of a line is a continuation onto the next line. Mailfilter now can be configured for separate exit codes on "nonspam", "spam" and "problem with the program". Exit codes on CRM114 itself have been made more appropriate; compiler errors and untrapped fatal faults now give an error exit code. Additionally, FAULT and TRAP are scrubbed, and the documentation made more accurate. June 10, 2003 news: This new version implements the new FAULT / TRAP semantics, so user programs can now do their own error catching and hopefully error fixups. Incomplete statements are now flagged a (little bit) better. Texts are now Base64-expanded and decommented before being learned There's a bunch of other bugfixes as well. Default window size is dropped to 8 megs, for compatiblity with HPUX (change this in crm114_config.h). June 01, 2003 news: the ALIUS statement - provides if/then/else and switch/case capabilities to CRM114 programmers. See the example code in aliustest.crm to get some understaning of the ALIUS statement. the ISOLATE statement - now takes a /:*:initial: value / for the freshly isolated variable. Mailfilter.crm is now MUCH more configurable, including inserting X-CRM114-Status: headers and passthru modes for Procmail, configurable verbosity on statistics and expansions, inserting trigger 'ADV:' tags into the subject line, and other good integration stuff. Overall speed has improved significantly - mailfilter is now about four times FASTER than SpamAssassin with no loss of accuracy. bugfix - we now include Ville Laurikari's TRE regexlib version 0.5.3 with CRM114; using it is still optional ("make experimental") but it's the recommended system if your inputs include NULL bytes. bugfix - OUTPUT to non-local files now goes where it claims, it should no longer be necessary to pad with a bunch of spaces. yet more additions to the .css files April 7th, 2003 version: 0) We're now up to "beta test quality"... no more "alpha" quality level. This is good. :-) 1) As always, lots of bugfixes. And LOTS of thanks from all of you poor victims out there. We've reached critical mass to the point now where I'm even getting bug _fix_ suggestions; this is great! If you do make a bug report or a bugfix suggestion, please include not only the version of CRM114 you're running, but also the OS and version of that OS you're running. I've seen people porting CRM114 to Debian, to BSD, to Solaris, and even to VMS... sp please let me know what you're running when you make a bug report. PLEASE PUT AT LEAST THE CRM114 VERSION IN THE SUBJECT LINE. 2) We now have an even better 'mailfilter.crm' . Even with the highly evolved spam in the last couple of, we're still solidly above 99% (averaging around 99.5%). (it's clear that the evolution is due to the pressures brought by Bayesian filters like CRM114)... some of these new spams are very, VERY good. But we chomp 'em anyway. :-) 3) The new metaflag "--" in a CRM1114 command line flags the demarcation between "flags for CRM114" and "flags for the user program to see as :_argN:". Command line arguments before the "--" are seen only by CRM114; arguments after the "--" are seen only by the user program. 4) EXPERIMENTAL DEPARTMENT: We now have better support for the 8-bit-clean, approximate-capable TRE regex engine. It's still experimental, but we now include TRE 0.5.1 directory in this kit; you can just go into that subdirectory, do a .configure, a make, and a make install there, and you'll have the TRE regex engine installed onto your machine (you need to be root to do this). Then go back up to the main install directory, and do a "make experimental" to compile and install the experimental version as /usr/bin/crma (the 'a' is for 'approximate regex support'. Using the experimental version 'crma' WILL NOT AFFECT the main-line version 'crm'; both can coexist without any problems. To use the approximate regex support (only in version 'crma') just add a second slashed string to the MATCH command. This string should contain four numbers, in the order SIMD (which every computer hacker should remember easily enough). The four integers are the: Substitution cost, Insertion cost Maximum cost Deletion cost in an approximate regex match. If you don't add the second slash-delimited string, you get ordinary matching. Example: match /foobar/ /1 1 1 1/ means match for the string "foobar" with at most one substitution, insertion, or deletion. This syntax will eventually improve- like the makefile says, this is an experimental option. DO NOT ASSUME that this syntax will not change TOTALLY in the near future. DO NOT USE THIS for production code. 4) Yet futher improvements to the debugger. 5) Further improvements to the classifier and the shipped .css files. 6) The "stats" variable in a CLASSIFY statement now gives you an extra value- the pR number. It's pR for the same reason pH is pH - it gives an easy way to express very large numeric ratios conveniently. The pR number is the log base 10 of the .css matchfile signal strength ratios; it typically ranges from +350 or so to -350 or so. If you're writing a system that uses CRM114 as a classifier, you should use pR as your decision criterion ( as used by mailfilter.crm and classifymail.crm, pR values > 0 indicate nonspam, <0 indicates spam ) If you want to add a third classification, say "SPAM/UNSURE/NONSPAM", use something like pR > 100 for nonspam, between +100 and -100 for unsure, and < -100 for spam. CAMRAM users, take note. :) 6) The functionality of 'procmailfilter.crm' has been merged back into mailfilter.crm, classifymail.crm, learnspam.crm and learnnonspam.crm. Do NOT use the old "procmailfilter.crm" any more - it's buggy, booger-filled, and unsupported from now on. PLEASE PLEASE PLEASE don't use it, and if you have been using it, please stop now! Jan 28th release news Many thanks to all of you who sent in fixes, and taught me some nice programming tricks on the side. 0) INCOMPATIBLE CHANGES: a) INCOMPATIBLE (but regularizing) change: Input took from the file [this-file.txt] but output went to (that-file.txt); this was a wart and is now fixed; INPUT and OUTPUT both now use the form of INPUT [the-file-in-boxes.txt] and OUTPUT [the-file-in-boxes.txt] b) INCOMPATIBLE (but often-requested) change: You don't need to say "#insert" any more. Now it's just ' insert ', with no '#' . Too many people were saying that #insert was bogus, and it was too easy to get it wrong. Now, insert looks like all other statements; insert yourfilenamehere.crm c) The gzip file no longer unpacks into "installdir", but into a directory named crm114- . 1) BUGFIXES: bugs stomped all over the place - debugger bugs (now the debugger doesn't go into lalaland if an error occurs in a batch file), infinite loop on bogus statements fixed, debugger "n" not doing the right thing), window statement cleaned and now works better, '\' now works correctly even in /match patterns/, default buffer length is now 16 megabytes (!), the program source file is now opened readonly. 2) 8-BIT-CLEAN: code cleanups and reorganizations to make CRM114 8-bit-cleaner; There may be bugs in this (may? MAY?) but it's a start. (note- you won't get much use of this unless you also turn on the TRE engine, see next item.) 3) REGEX ENGINES: the default regex engine is still GNU REGEX (which is not 8-bit-clean) but we include the TRE regex engine as well (which is not only 8-bit-clean, but also does approximate regexes. TRE is still experimental, you will need to edit crm114_config.h to turn it on and then rebuild from sources. Do searches of www.freshmeat.net to see when the next rev of TRE comes out. 4) SUBPROCESSES: Spawned minion buffers now set as a fraction of the data window size, so programs don't die on overlength buffers if they copy a full minion output buffer into a non-empty main data window. The current default size is scaled to the size of the main data buffers, currently 1/8th of a data buffer, with the new default of a 16-meg allocate-on-the-fly data buffer that means your subprocesses can spout up to 2 megs of data before you need to think about using asynchronous processes. 5) The debugger now talks to your tty even if you've redirected stdin to come from a data file. EOF on the controlling tty exits the program, so -d nnnn sets an upper limit on the number of cycles an unattended batch process will run before it exits. (this added because I totally hosed my mailserver with an infinite loop. Quite the "learning experience", but I advise against it. ) 6) An improved tokenizer for mail filtering. You can pick any of 7) Option for exit codes for easy ProcMail integration, so the old "procmailfilter.crm" file goes away, it's no longer necessary to have that code fork., 8) For those of you who want eaiser integration with your local mail delivery software, without all the hassle of configuring mailfilter.crm, there's three new very bare-bones programs, meant to be called from Procmail. These do NOT use the blacklist or whitelist files, nor can they be remotely commanded like the full mailfilter.crm: learnspam.crm learnnonspam.crm classifymail.crm * learnspam.crm < some-spam.txt will learn that spam into your current spam.css database. Old spam stays there, so this is an "incremental" learn. * learnnonspam.crm < some-non-spam.txt will learn that nonspam into your current nonspam.css database. Old nonspam stays there, so this is an "incremental" learn. * classifymail.crm < mail-message.txt will do basic classification of text. This code doesn't do all the advanced things like base-64 armor-piercing nor html comment removal that mailfilter.crm does, and so it isn't as accurate, but it's easier to understand how to set it up and use it. Classifymail.crm returns a 0 exit code on nonspam, and a 1 exit code on spam. Simple, eh? Classifymail does NOT return the full text of the message, you need to get that another way (or modify classifymail.crm to output it- just put an "accept" statement right before the two "output ..." statements and you'll get the full incoming text, unaltered. November 26, 2002: NEW Built-in Debugger - the "-d" flag at the end of the command line puts you into a line-oriented high-level debugger for CRM114 programs. Improved Classifier - the new classifier math is giving me > 99.92% accuracy (N+1 scaling). In other words, once the classifier is trained to your errors, you should see less than one spam per thousand sneak through. Bug fixes - the code base now should compile more cleanly on newer systems that have IEEE float.h defined. Security fix- a non-exploitable buffer overflow fixed Documentation fixes - Serious doc errors were fixed Nov 8th 2002 version *) Procmail users: a version of mailfilter.crm specifically set up for calling from inside procmail is included- see the file "procmailfilter.crm" for the filter, and "procmailrc.recipe" for an example recipe of how to call it. (courtesy Craig Hagan) *) Bayesian Chain Rule implemented - scoring is now done in a much more mathematically well-founded way. Because of this, you may see some retraining required, but it shouldn't be a lot. Users that couldn't use my pre-supplied .css files should delete the supplied .css files and retrain from their own spamtext.txt and nonspamtext.txt files. *) classifier polynomial calculation has been improved but is compatible with previous .css files. *) -s will let you change the default size for creating new .css files (needed only if you have HUGE training sets.) Rule of thumb: the .css files should be at least 4x the size of the training set. *) Multiple .css files will now combine correctly - that is, if you have categorized your mail into more than "spam" and "nonspam", it now works correctly. Ex: You might create categories "beer", "flames", "rants", "kernel", "parties", and "spam", and all of these categories will plug-and-play together in a reasonable way, *) speed and correctness improvements - some previously fatal errors can now be corrected automagically. Oct 31 2002 Bayesian Chain Rule implemented - scoring is now done in a much more mathematically well-founded way. Because of this, you may see some retraining required, but it shouldn't be a lot. Users that couldn't use my pre-supplied .css files should delete the supplied .css files and retrain from their own spamtext.txt and . nonspamtext.txt files. Classifier polynomial calculation has been improved but is compatible with previous .css files. -s will let you change the default size for creating new .css files (needed only if you have HUGE training sets.) Rule of thumb: the .css files should be at least 4x the size of the training set. Multiple .css files will now combine correctly - that is, if you have categorized your mail into more than "spam" and "nonspam", it now works correctly. Ex: You might create categories "beer", "flames", "rants", "kernel", "parties", and "spam", and all of these categories will plug-and-play together in a reasonable way, e.g. classify (flames.css rants.css spam.css | beer.css parties.css kernel.css) will split out flames, rants, and spam from beer, parties, and linux-kernel email. (I don't supply .css files for anything but spam and nonspam, though.) Lastly, there are some new speed and correctness improvements - some previously fatal errors can now be corrected automagically. Oct 21 2002 Improvements everywhere - a new symmetric declensional parser, a much more powerful and accurate sparse binary polynomial hash system ( sadly, incompatible; - if you LEARNed new data into the .css files, you must use learntest.crm to LEARN the new data into the new .css files as the old file used a less effective polynomial.) Also, many bugfixes including buffer overflows fixed, -u to change user, -e to ignore environment variables, optional [:domain:] restrictions allowed on LEARN and CLASSIFY, status output on CLASSIFY, and exit return codes. Grotty code has been removed, the Remote LEARN invocation now cleaned up, and CSSUTIL has been scrubbed up. Oct 5 2002 Craig Rowland points out a possible buffer exploit- it's been fixed. In the process, the -w flag now boosts all intermediate calculation text buffers as well, so you can do some big big things without blowiing the gaskets. :)