The CRM114 and CRM114 Mailfilter FAQ


*** What does CRM114 stand for?

- CRM stands for Controllable Regex Mutilator, concept # 114.  It's a
mutilating regex engine, designed to slice and dice text with the
vigor of a Cuisinart on an overripe zucchini.   There is no truth
to the rumor that it really means "Crash's Regex Monstrosity".


*** Very funny.  What does CRM114 _really_ stand for?

- CRM114, or more accurately, "the CRM114 Discriminator" is from the
Stanley Kubrick movie "Dr. Strangelove".  (an _excellent_ movie- you
should go get it and watch it.  Really.  Some critics have said it is
the greatest movie of it's era; others are more accurate and call it
the greatest movie _ever_ made.  In my opinion, a hundred years from
now, Dr. Strangelove will be considered the _definitive_ satirical
history of the Cold War and perhaps of the second half of the 20th
Century).  But I digress...

Anyway, the "CRM114 Discriminator" in the movie is a fictional
accessory for a radio reciever that's "designed not to recieve _at
all_", that is, unless the message is properly authenticated.  That
was the original goal of CRM114 - to discriminate between authentic
messages of importance, and get rid of the rest.

Note the emphasis on "get rid of".  Unlike many other "filters", 
CRM114's default action is to read all of input, and put NOTHING onto
output.  The simplest possible CRM114 program does exactly that, 
in fact- read all of stdin, and throw it away.


*** I've got a bug!  What do I do?

First, read the _whole_ HOWTO, which will tell you a few useful
tricks and diagnostics.  You can +probably+ fix the bug yourself. 

Then, read the rest of this document (the FAQ).  

If it's not clear at this point, and you _really_ have read both
the FAQ and the HOWTO, then you have a choice:

    1) Smart computer people will then _also_ read the QUICKREF.txt file
    to understand the CRM114 language, and then the INTRO.txt file
    to see how it all works.  Then they might try debugging
    themselves.

    2) Less computer-savvy types at this point might want to 
    put their question onto the crm114 mailing list.  Hint: the
    location of the mailing list is hidden somewhere in the
    documents available from the CRM114 webpage, which includes
    the above - the HOWTO, the FAQ, the QUICKREF, and the INTRO.  

So, there's a good reason to read the docs.  :-)


*** This is a BIG bug!  I got it to SEGFAULT!

You should still read this document, the HOWTO, the QUICKREF, and 
known-bugs.txt, as the bug may already be known and a workaround
developed (or it may be a problem with some external system that's
also known). 

Of course, if you've managed to SEGFAULT the CRM114 system, then
try to reduce your program and data right down to the minimum
needed (if you're using an unaltered mailfilter.crm, then don't worry
about this).  Then please let us know on the main CRM114 general list.

IF YOU POST A BUG REPORT, PLEASE INCLUDE THE FOLLOWING:

* - What version of CRM114 you are running (find out by typing "crm -v",
  or by turning on "add headers" and looking at the X-CRM114-Version
  header.

* - What version (if any) of mailfilter.crm you're using (if you have
  "add headers" turned on, which it is by default, you will have this
  as a checksum in the X-CRM114-Version header as "MF-something"

* - Any other details that might be pertinient, like how you invoke
  CRM114 (via procmail, via .forward, etc).


*** I'm training my CRM114 install.  When I mail mistakes back to
    myself to retrain, I was wondering which headers to include and
    where to place the retrain command?

The basic rule is to make the stuff after the COMMAND look as much
like the original misclassified mail as possible.  Since you
can configure CRM114 to do things to the subject, the header, and the
body, you have to _undo_ that stuff (design flaw there?  Maybe!)

What you should strive for is an email "forward" below the COMMAND
line that looks _exactly_ like the mail looked before it got to CRM114
in the first place, with all the headers there and intact.

Expand the headers fully - then remove all of the headers that
CRM114 inserted.  Then check the text; down a ways, you may find
a CRM114 "statistics" section - remove that.  You may also find an
"extra stuff" section, remove that too.

Then put the COMMAND line right before the first of the expanded headers.

Then, before you hit <send>, check your work; you can cost yourself
considerable accuracy by training in the wrong things!


*** What do these wierd version names mean?

The version _number_ of a CRM114 release is the year-month-day at
which that version went into testing.  For example, 20031225 means
year 2003, month 12, day 25, or Christmas Day, 2003.  This makes it
easy to see how old a version of CRM114 is.  As open-source software
revs very quickly to fix bugs and incorporate improvements, if you
have a version more than a month or two old you probably are
running obsolete software.

The -Blame<somebody> is an easier way to refer to a version; it
reflects one or more of:
	 1) someone who sent in a massive or important patch that
	    fixed a big problem
	 2) someone who pointed out a big problem, thereby motivating
	    me to fix it
         3) someone or something that either motivated me to get some
	    work done, or impeded that work, by means unstated (but
	    you can often guess... I'll give you a hint- the sushi
	    waitresses did _not_ send in a patch).

Generally speaking, it is an _honor_ to be blamed for a particular
release of CRM114, and recipients of that honor should wear it 
proudly.  :-)


*** I've got a _ton_ of spam in my library.  Why shouldn't I just 
    load it into CRM114 and get a head start on training?

Bad idea.  CRM114's learning algorithm is predicated on using the
TOE strategy - that is, Train Only Errors.  When CRM114's mailfilter
makes a mistake, you train in the right result and it will do better
next time.  

I've tested this _exhaustively_, spending a few CPU-weeks in the
process; CRM114 really does work best if you train in only errors, and
in the order encountered.  It's about a factor of two times more
accurate, and about a factor of two times faster during the execution.

The actual numbers work out something like this: I used a torture-test
corpus of 4147 messages, split roughly 60/40 between nonspam and spam.
Running TOE, with the 5th order polynomial and entropic correction,
the error rate curve showed a nice exponential approach to zero errors.
Reshuffling the corpus of 4147 messages ten different ways, the final 
error rate (that is, the error rate in the last 500 messages) was just
about 6.9 errors per 500 final messages, or 1.3% (very good on such
a difficult corpus.  I _personally_, when hand-scoring these messages,
get about a 30% error rate).

Training _every_ sample yielded about 14.9 errors in the final 500 
messages, or an error rate of about 2.9 %.  Interestingly, the error
curve or training every sample dove more quickly initially, but 
then _rose_ again as new items were trained.

The relative runtimes were 14 minutes (roughly) for TOE and training
only errors, versus about 29 minutes (roughly) for training everything,
averaged across the 10 runs of 4147 messages each.

So, if you don't mind being something more than a factor of 2 less
accurate, and twice as slow, you can go ahead and train everything.
Seriously- if you want accuracy, start from an empty .css file and
train only errors, as you encounter them.


***  But WHY does it work better to train only errors ?

- Intuitively, here's how you can understand it:

If you train in only on an error, that's close to the minimal change
necessary to obtain correct behavior from the filter.  

If you train in something that would have been classified correctly
anyway, you have now set up a prejudice (an inappropriately strong
reaction) to that particular text.

Now, that prejudice will make it _harder_ to re-learn correct behavior on
the next piece of text that isn't right.  Instead of just learning
the correct behavior, we first have to unlearn the prejudice, and
_then_ learn the correct behavior.  It can be done- but it doesn't
converge on the right answer as fast as never getting these unwarranted
prejudices in the first place.

In filters as in people, prejudices are generalizations that are best
avoided.  

There is a secondary effect as well, due to the limits to growth of
the .css files.  If you train everything, you will typically start
seeing CRM114 go into microgrooming at around a megabyte of text.
This is because there is a limited amount of space in a growth-limited
.css file.  When you reach this point, for every new feature added, at
least one old feature must be forgotten.  This loss of information is
a mixed blessing- although useful information is now being lost, old
prejudices are also being forgotten.  This slow tracking allows even
an aging, saturated CRM114 system data base to adapt to an evolving
spam stream.


*** Why are the bucket files called .css files?  They aren't cascading
    style sheets.

The .css suffix for SBPH bucket files originally stood for CRM Sparse
Spectra, until it was pointed out by a colleague that "sparse spectra"
was actually taken by another related but different method.  The name
stuck, even though it was no longer strictly accurate.


*** How accurate is CRM114 for anti-spam filtering?

Depending on your spam/nonspam mix, _very_ accurate.  I regularly
clock over 99%; I've had months where it was over 99.9%.

DON'T expect this level of performance without training on your
errors for a week or more.

Also note: spam _evolves_.  A filter that was perfect in June may
be making errors by December as spam topics change and attacks
vary.  Don't feel bad if you have to retrain.


*** It was working fine, then I trained one thing and it started making 
    mistakes again!  Did I break it?

Ah, you've encountered what we've termed an "error shower", (or, depending
on topic, a Porn Storm).  What's happened is that your filter was just on the
verge of accuracy; it made an error, you retrained it, but the retrain went
too far.

Don't worry.  Keep training, and the error shower will damp out and you'll
quickly converge on an even more accurate filter.


*** Why did you make the CRM114 language so weird?

- Because I had some ideas about how I thought a "filter language"
should be, and wanted to see how they worked out in practice.  I had a
bad experience with PERL, so I wanted a language where everything was
easy to understand, where the actions of a particular statement could
always be determined without referring to ANY other statement, let
alone "magic mind reading" and "action-at-a-distance"...  I probably
would do it differently now that I've done it this way.


*** So, is CRM114 a mailfilter, or what?

- No, CRM114 is actually a language that makes it easy to write filters
of any sort.  The most useful of these so far is for mail filtering;
the CRM114 distribution pack contains a pretty reasonable mail filter
for people who want it to "just work".  

Other people have written Usenet filters, Web content filters, and
(in a spree of creative hackery) a "cheater seeker" to find people
who were playing multiple users in a competitive email-based roleplay
game (and, by violating the one-user-one-player-character rule) gaining
an unfair advantage over the other game players.


*** What algorithm does the mailfilter use?

- There's a whole file that just describes it ("classify_details.txt")
in the distribution, but in short, it matches short phrases from the
incoming text with phrases you supplied previously as example text.
In reality, it does a lot of hashing and polynomials to make the run
time acceptable.

I call the filtering algorithm Sparse Binary Polynomial Hashing with
Bayesian Chain Rule evaluation (SBPH/BCR), which gives you a vague
idea of how it might work inside.

Note that in CRM114's included Mailfilter.crm, we do NOT do "special
tagging", such as creating special tokens saying "This was in the
Subject" or "This was in the Recieved header chain".  The short-phrase
sliding window is long enough that such tokens aren't necessary.

Minor Update- by altering the weightings of different lengths of short
phrases, it's possible to change the behavior of SBPH/BCR from a
strict Bayesian, to an entropically-corrected Bayesian, to a
Markovian matcher.   Releases since roughly 20040101 have all had this
improved Markovian matcher as the default configuration as this 
has been tested and demonstrated to provide the best performance.


*** So that's it?

- Mathematically, yes.  But since about 2003-11-xx, the chain 
rule function has been updated with entropic correction; this puts 
more weight on longer chains.  

In effect, this is a Markov model of the data stream with lots
of hidden states.  So, SBPH/BCR is really not SBPH/BCR, its more
of Sparse Binary Polynomial Hashing / Bayesian-Markov Model (SBPH/BMM).

The really nice thing about SBPH/BMM is that it's slightly more
accurate than the previous SBPH/BCR and it's 100% upward compatible
with /BCR data files.  All the information was there, it just needed
to be used properly.  


*** Why didn't you just use Bayesian filtering?

- I had played with single-word Bayesian filtering from '96 through
2000 and found that it could behave very well on very large input
texts (typically, tens to hundreds of megabytes).  But first brutally
naive implementation was far too memory-intensive to use for real
filtering; Paul Graham and others have refined Bayesian filtering to
the point where it's actually very useful for large numbers of people
to use (by clipping the less significant words).

In short, I didn't think that Bayesian filtering would work as well as
it does; I was wrong.  So, I tried a different idea and it seems to
work pretty well too.  The two methods are closely related; SBPH/BCR
with a polynomial of order 1 (that is, phrase length == 1 word) is
completely equivalent to 1-word Bayesian filtering without
insignificant-word and hapax clipping.

(addendum: as of the Nov 5th 2002 edition of CRM114, the classifier
does indeed do full Bayesian matching on these polynomial features.
This improves accuracy out into the >99.9% region, and
December-2003-onward versions default to use Markov weighting as well,
which gives somewhat better accuracy than entropically corrected
Bayesian weighting. )


*** Can I use the CRM114 mailfilter from inside PROCMAIL?

- Yes.  You'll want to edit the file mailfilter.crm to change the
actions from "accept" to "exit /0/" when the mail is good, and
from mailing your spambucket account with "syscall ..." to an "exit /1/"
when the mail is spam.  But yes, you can.


*** It's making too many mistakes!  What did I do wrong?

- You probably didn't do anything wrong.  What's probably happened
is that your spam/nonspam mix is very different than mine.  This causes
the words and phrases in your spam/nonspam to not match up with the
words/phrases in mine.

The fix is to train the mailfilter anytime it makes an error.  The
filter learns very fast; you should see drastic improvement after a
single day of error feedback.  I usually pass 99% accuracy at two or
three days, starting from zero.

In extreme cases, delete the pre-generated spam.css and nonspam.css
files, and start from scratch with the training.  In one day, (and
assuming sufficient spam and nonspam) you should be around 97%, two
days 98%, and three days > 99%.


*** How much data does it take to get that accurate?

- Not a lot.  At 99.67% accuracy, I only had 84K of nonspam and 185K of
spam text.  Interestingly, because spam contains a lot of run-on HTML,
the total number of hashed datapoint features is roughly equal.
 

*** I tried training in a huge amount of spam or nonspam, and it hung!

- Actually, it probably didn't hang.  Training is slow, only 10K or
so per second, so a half-meg spam bucket may take on the order of
a few minutes to train in.  Give it time.  :)


*** I trained in (some huge amount) of spam and nonspam, and it
    doesn't work any more!!!	    

- As noted above, you can overflow the buckets in the .css files
if you train in too much spam or nonspam.  You should get very good
results with less than 100K each of spam and nonspam text
(roughly equal numbers of messages is good too).  

Use the most recent spam and nonspam you can get your hands on.
Don't use spam more than a few months old for training.

And realize, if you're doing any "bulk training", rather than
Train Only Errors, that you could be doing 2x _better_ if you
trained only errors.  So there.  :)


*** Does CRM114 or the mailfilter work for any language besides English?

CRM114 uses 8-bit ASCII characters, and is 8-bit clean except for NULL
string terminations (which are forced by the GNU REGEX library, not my
decision).  I you use the included (and defaulted) TRE regex engine instead,
it's a NULL-clean system and you should be OK for 8-bit languages.

BUT if you use a unicode-based or other wide-character language,
you'll need to port up CRM114 to use wchar instead of char, as well as
getting unicode-clean regex libraries (there is a version of TRE that
does that, nicely enough).  This is not a minor undertaking, but if
you do it, please let me know and I'd gladly roll your changes back
into the standard CRM114 kit.

That said, if you get _mail_ in any language other than English, there
are two possibilities.  If you're lucky, you use a language that fits
in 8-bit characters.  In that case, you can just delete the spam.css
and nonspam.css files, and re-train the mailfilter for your local spam
mix.  Otherwise, you're stuck with wchars, so see above.

(Note: new versions of CRM114 since August 2003 default to use the TRE
library, which is both 8-bit-safe and has fewer edge errors than the
GNU library.  The GNU-based version remains available as a Makefile
option for those who depend on the GNU idiosyncracies.)


*** Why is LEARNing or CLASSIFYing so slow?

- It's not _that_ slow.  In fact, it's really quite fast.  With a
(relatively slow) Transmeta Crusoe 666MHz and a slow laptop disk,
CRM114 can "learn" at about 10kbytes/second, and can classify text
about twice as fast (20Kbytes/second), which compares very favorably with
genetic algorithms or neural nets.  Of course, that assumes that the
.css file is already in the UMB's (a reasonable assumption); if
they're not, add a reasonable amount of time for disk I/O to page in
the needed bits.

Note that this is still quite a bit slower than things like Paul
Graham's stuff, or Eric Raymond's Bogofilter.  It's a trade-off; 
that's what Open Source is all about.

Note that because LEARNing and CLASSIFYing do a lot of very randomized
accesses into the bucket files, these two verbs will thrash cache
pretty intensely.  I've had reports that 16MB bucket files will learn
or classify at horrendously slow rates- the results are still correct
and accurate, but it's very annoying.  We have a workaround plan (to
do sorted access, or use a tree sructure) in consideration.

(Note: in the ever-improving scheme of Free Software, the speed of
CRM114 has been continuously improved; we are now several times FASTER
than SpamAssasin.  But I suspect the SpamAssassin folks are not going
to take this challenge lying down.)


*** Why is CRM114 such a memory pig?

- It's not _that piggish_.  To keep speed up, the CRM114 engine
preallocates five buffers for data; each buffer is the size of a data
window (default 8 megabytes each, change it with the -w option).
Small buffers are allocated dynamically on the stack; expect to see
50K or so there.  LEARN and CLASSIFY use mmap to access the .css files
as part of virtual memory, so each .css file will consume a fair
amount of virtual memory (by default, 24 megabytes per .css file, but
this is released as soon as it's no longer needed, and since it's
mmaped rather than malloced, it does not require paging file or
swapfile space).  Also, since mmap does I/O through the fast paging
system rather than the file IO system, it runs VERY fast.


*** Aren't you afraid spammers will dissect CRM114 in order to beat it?

- Not really.  The basis of the LEARN/CLASSIFY filter is to look at
significant phrases in human language.  At least in English, there are
relatively few "natural" phrases one can use to sell Viagra, porn, or
low-interest mortgages.  So, a spammer trying to beat CRM114 would
have to avoid those phrases, and instead use phrases used in normal
non-sales-pitch discourse.

The cool part is that the non-sales-pitch discourse has no way to
express the sales pitch!  The medium cannot carry the message, there's
just no way to say it.  So the spammers are simply unable to function.


*** That sounds awfully close to 1984 and Newspeak.

- Yes, I realize this, and _yes_, it bothers me.  CRM114 could provide a
uniquely powerful tool for censorship.  But from what I can tell from
the public literature, the concept of phrase analysis is nothing new
to the CIA or the NSA.


*** Why can't you give me your sample spam and nonspam files?

- I can't give the text out because I don't own the copyright on it!
Spam text often has a copyright notice at the bottom, and nonspam text
(stuff my friends/cow-orkers/etc send me) is clearly copyright _them_,
not _me_.  So, it would be a gross breach of confidence at the very
least, if not an outright violation of any reasonable copyright law,
for me to distribute that text.

Fear not, you don't _have_ to trust my "magic files" to not contain
a hidden agenda.  You can rebuild the .css files with your own
spamtext.txt and nonspamtext.txt files easily.  Just delete *.css and
then create two files of spam and nonspam "spamtext.txt" and
"nonspamtext.txt".  Run the "make cssfiles" command and new .css files
will be built.

Even better, delete the .css files, type 

     cssutil -b -s spam.css
     cssutil -b -s nonspam.css

and train only errors for a few days; you'll end up with a highly accurate
filter that matches exactly the kind of mail you get, and the kind
of spam you get.


------ OLD, OBSOLETE QUESTIONS -------

***  When will CRM114 go to full Bayesian?

As of Nov 1 2002, it has.  :-)  See the file "classify_details.txt" for 
the full scoop.

We may change the Bayesian Chain Rule at some point in the future; the
reason is that the standard Bayesian Chain Rule (BCR) has an
underlying assumption of statistical independence on the input events.
Unfortunately, spam features and nonspam features are NOT independent
and so BCR is really quite incorrect to use.  I'm working on better
alternatives and they will appear as they are found, tested, and
proven to work better than BCR.