How CRM114's LEARN and CLASSIFY really work. This document describes the internal workings of the CRM114 LEARN and CLASSIFY functions. You do _not_ need to know this to use CRM114 effectively; this is to satisfy the curiosity of those who really want deep knowledge of the tools they use. (NOTE: since CRM114 now has multiple classifiers available, please read this whole document. Some of the classifiers are interoperable, and some are not.) The current distribution builds in a set of _four_ classifiers. The classifiers are: 1) SBPH Markovian (the default) This is an extension of Bayesian classification, mapping features in the input text into a Markov Random Field. This turns each token in the input into 2^(N-1) features, which gives high accuracy but at high computation and memory cost. 2) OSB Markovian - This is a version of the Markovian that uses an orthogonal sparse bigram (OSB) feature set, instead of the SBPH features. OSB seems to be neck-and-neck in accuracy versus SBPH but it's considerably faster and uses less memory for the same amount of detail. Because OSB Markovian is a subset of SBPH Markovian, you can "sort of" intermix .css files generated by SBPH Markovian and OSB Markovian, although there will be some loss of accuracy. Fidelis Assis contributed the idea of using OSB instead of full SBPH feature sets and showed that OSB actually had advantages. 3) OSB Winnow - This classifier uses the same feature set as OSB-Markovian but doesn't use a probabalistic estimation at all. Instead, it uses the Winnow algorithm. The data files aren't compatible, although a good hacker could probably come up with a way to get an approximate conversion back and forth from the Markovian models. 4) Correlator classification - This classifier doesn't do tokenization at all. Instead, it slides the example and unknown texts across each other and measures the cross-correllation. The final scores go with the square of the run-lengths of matching strings. This matcher is -very- slow... easily 40 to 100x slower than any of the other classifiers. It _will_ work against binary files, though, which none of the other classifiers will. 5) Format of the .css and .cow files and microgrooming - some design notes and how microgrooming works. Here's the details for each classifier: Classifier 1: Markovian The general concept is this: break the incoming text into short phrases of from one to five words each. A phrase can have words in the middle of the phrase skipped (e.g. "BUY ONLINE NOW!!!" is always a bad sign.), and more than one phrase can use the same word. You can't change the order of the words, but you _can_ bridge across newlines, punctuation, etc. Make all the phrases you can make. For each phrase you can make, keep track of how many times you see that phrase in both the spam and nonspam categories. When you need to classify some text, make the phrases, and count up how many times all of the phrases appear in the two different categories. The category with the most phrase matches wins. Note that you never have to cross-reference between the two category phrase sets. If a phrase appears in both categories an equal number of times, then both categories get an equal score boost. Since an equal score boost doesn't change which category will win, there's no need to cross-reference category phrase counts. NB: This process is called "sparse binary polynomial hashing" because it uses a set of polynomials to generate a hash-of-hashes; sparse because not all words are represented by nonzero terms, binary because the changing coefficient terms are always either 0 or 1, and a hash because, well, it's a hash. :) Instead of simply comparing raw count scores, we do a Bayesian chain-rule to calculate the probability of "good" versus "evil". (CRM114 actually has no knowledge of spam and nonspam, just two sets of user-defined classes that can be whatever you want to be. This explanation will use 'spam' and 'nonspam', but internally, it's just "these statistics files here" and "those statistics files there") The Bayesian chainrule formula is P(A|S) P(S) P (S|A) = ------------------------- P(A|S) P(S) + P(A|NS) P(NS) which (in words) says: "The NEW chance of spam, given some feature A, is equal to the chance of A given spam times the OLD chance that it's spam, divided by the sum of the chance of A given spam times the old chance it's spam plus the chance of A given nonspam, times the old chance it's nonspam".) We start assuming that the chance of spam is 50/50. We count up the total number of features in the "good" versus "evil" feature .css files. We use these counts to normalize the chances of good versus evil features, so if your training sets are mostly "good", it doesn't predispose the filter to think that everything is good. We repeatedy form a feature with the polynomials, check the .css files to see what the counts of that feature are for spam and nonspam, and use the counts to calculate P(A|S) and P(A|NS) [remember, we correct for the fact that we may have different total counts in the spam and nonspam categories]. We also bound P(A|S) and P(A|NS) to prevent any 0.0 or 1.0 probabilities from saturating the system. If you allow even _one_ 0.0 or 1.0 into the chain rule, there's no way for the system to recover even in the face of overwhelming evidence to the contrary. The actual bound in use depends on the total number of counts of the feature A ever encountered, irrespective of their good/evil nature. [additional note: versions from 20030630 to 20031200 used a fairly gentle method to generate the local probabilities from the relative hit counts. From 20031200 onward, this local probability was modified by the number and sequence of the terms of the polynomial. The best model found so far is a set of coefficients that model a Markov chain; polynomials that have a longer chain length (and therefore a closer match) get a significantly higher boost.] Once we have P(A|S) and P(A|NS), we can calculate the new P(S) and P(NS). Then we get the next feature out of the polynomial hash pipeline (each extra word makes 15 features) and repeat until we hit the end of the text. Whichever set has the greater probability wins. We also take multiple files AS A GROUP, so it's as though we added the corresponding hash buckets together for everything on the left of the | and everything on the right. ----- Now, on to the brutish details for the Markovian classifier: In terms of the actual implementation, LEARN and CLASSIFY are pipelined operations. The pipeline has these stages (as of the 2002-10-21 version) : 1) Tokenization. The input text is tokenized with the supplied regex (usually [[:graph:]]+ ) into a series of disjointed word tokens. 2) Each word token is hashed separately. The hash used is a "fast hash", not particularly secure, but with reasonably good statistics. 3) Each hash is pushed into the end of a five-stage pipeline. Each value previously pushed moves down one level in the pipeline. 4) The pipeline stages are tapped to supply values H0 through H4 that will be multiplied by the particular polynomial's coefficients. (H4 being the newest value). 5) After each value is pushed into the hash pipeline, the full set of polynomials are evaluated. These polynomials have changed over various releases, but as of 2002-10-23 the coefficients are: poly# \ for: H4 H3 H2 H1 H0 1 0 0 0 0 1 2 0 0 0 3 1 3 0 0 5 0 1 4 0 0 5 3 1 5 0 9 0 0 1 6 0 9 0 3 1 7 0 9 5 0 1 8 0 9 5 3 1 9 17 0 0 0 1 10 17 0 0 3 1 11 17 0 5 0 1 12 17 0 5 3 1 13 17 9 0 0 1 14 17 9 0 3 1 15 17 9 5 0 1 16 17 9 5 3 1 (yes, it's like counting in binary, but the low-order bit is always turned on so that the low order bits in the polynomial result is always affected by all nonzero elements of the hash pipeline. "skipped" words have a coefficient of zero, that zeroes their effect on the output of that polynomial, "skipping" the word) 6) These 16 results (call them "superhashes") reflect all phrases up to length 5 found in the input text. Each is 32 bits long. 7) Each of the .css files is mmapped into virtual memory. The default size of a .css file is one megabyte plus one byte, and each byte of a .css file is used as a single 8-bit unsigned integer. Using the length of the .css file as a modulus, each superhash value maps into a particular byte of the .css file. Each .css file also has a "score", initialized to zero. 8) if we're LEARNing, we increment the byte at that superhash index in the .css file (being careful to not overflow the 8-bit limit, so the maximum value is actually 255) 9) (pre-Nov-2002 versions): if we're CLASSIFYing, we increment the per-.css-file score of that .css file by the number found in that superhash-indexed byte. (post-Oct-2002 versions): if we're CLASSIFYing, instead of just incrementing the per-.CSS file scores, we (a) normalize the relative proportions of the .css files with respect to the total number of features in each .css file, (b) convert the bin values indexed by the superhash to a probability, (c) "clip" the probability values to reasonable values (there is no such thing as "certainty" with a finite sample of an infinite and nonstationary source such as human language), and (d) update the running probability using the Bayesian chain rule formula above. 10) repeat the above pipeline steps for each "word" in the text. 11) The .css file with the larger score (or probability) at the end "wins". There you have it. Previous plynomial sets (using only H0 thorugh H3 of the hash pipeline, with prime-number coefficients) have reached over 99.87% accuracy. The best that the 5-stage pipeline has reached for me is 99.984%, and it averages around 99.95% accuracy over months and months of use. n.b. slight error in edge effects - right now, we don't execute the pipeline polynomial set until the ppeline is full; conversely we stop executing the polynomial set when we run out of tokens. This means that we don't give the first and last few tokens of the email the full treatment; that's a bug that should be rectified. The other side of the problem is that filling and flushing the pipe gives worse results by putting too much emphasis on "zero hash" and too much emphasis on the first and last few words. n.b.: Arne's Optimization: If the singleton word (H0 alone) doesn't appear or has a count of 0, then it's useless to check for any further combinations, as you know they can't appear unless H0 also appeared. This speedup gives you about 2x speed improvement. ---More details on the post-Nov-2002 release:--- In releaes after Nov 1 2002, instead of just comparing counts, we do the true Bayesian chain rule to calculate the probabilities of pass versus fail. The bounding limits are first to bound within [ 1/featurecount+2 , 1 - 1/featurecount+2]. and then to add further uncertainty to that bound additionally by a factor of 1/(featurecount+1). We do the chain rule calculation and then we clip the minimum probability to MINDOUBLE, which is host specific but is a VERY small number (on the order of 10^-300 for Intel boxes). This further prevents getting the chain rule stuck in a 0.0 / 1.0 state, from which there is no recovery. Lastly, because of underflow issues, we quickly lose significance in the greater of the two probabilities. For example, 1.0 - (10^-30) is exactly equal to 1.00000; yet 10^-30 is easily achieveable in the first ten lines of text. Therefore, we calculate the chainrule probabilities twice, using P(S) and P(NS) separately, and then use the smaller one to recompute the larger one. Thus, even if there's arithmetic underflow in computing the larger probability, we still retain the full information in the smaller probability. --- Yet More Details - for Post-200310xx Versions ---- During the summer and fall of 2003, I continued experimenting with improvements to SBPH/BCR as described above. It became clear that SBPH/BCR was _very_ good, but that it was still operating within the limits of a linear classifier without hidden levels- e.g. it was a perceptron (with all of the limitations that perceptron-based classifiers have). Luckily, the databases in CRM114 are more than adequate to support a higher-level model than a simple linear perceptron classifier. I tested a 5th order Markovian classifier, and found that it was superior to any other classifier I had tried. A Markovian classifier operates on the concept that _patterns_ of words are far more important than individual words. For example, a Bayesian encountering the phrase "the quick brown fox jumped" would have five features: "the", "quick", "brown", "fox", and "jumped". A Sparse Binary Polynomial Hasher would have sixteen features: the the quick the brown the quick brown the fox the quick fox the brown fox the quick brown fox ... and so on. But each of these features would recieve the same weighting in the Bayesian chain rule above. The change to become a Markovian is simple- instead of giving each Sparse Binary Polynomial Hash (SBPH) feature a weight of 1, give each feature a weight corresponding to how long a Markov Chain it matches in either of the archetype texts. A simple way to do this would be to make the weight equal to the number of words matched - in this case the weights would be: the 1 the quick 2 the brown 2 the quick brown 3 the fox 2 the quick fox 3 the brown fox 3 the quick brown fox 4 and indeed, this gives some improvement over standard SBPH. But there is room for further improvement. The filter as stated above is still a linear filter; it cannot learn (or even express!) anything of the form: "A" or "B" but not both This is a basic limit discovered by Minsky and Papert in 1969 and published in _Perceptrons_. In this particular case there is a convenient way to work around this problem. The solution is to make the weights of the terms "superincreasing", such that long Markov chain features have so high a weight that shorter chains are completely overruled. For example, if we wanted to do "A or B but not both" in such a superincreasing filter, the weights: "A" at 1 "B" at 1 "A B" at -4 will give the desired results. For convenience in calculation, CRM114 uses the superincreasing weights defined by the series 2^(2n)- that is, the 1 the quick 4 the brown 4 the quick brown 16 the fox 4 the quick fox 16 the brown fox 16 the quick brown fox 64 Note that with these weights, a chain of length N can override all chains of length N-1, N-2, N-3... and so on. This is particularly satisfying, because the standard .css files already contain all of the information needed to do this more advanced calculation. The file format is not only compatible, it is _identical_ and so users don't have to re-start their training. This Markovian matching gives a considerable increase in accuracy over SBPH matching, and almost a factor of 2 improvement over Bayesian matching. It is now the default matching system in CRM114 as of version 200310xx. -------------------------------------------------------- The OSB Markovian classifer OSB (Orthogonal Sparse Bigram) is a simplification of SBPH inspired by Fidelis Assis. The change is to _omit_ all of the word combinations that don't have exactly two word tokens in it. This method has fewer features, but is often as good as or even better than Markovian in accuracy. Because it has fewer features, it needs less space in the .css files for equal accuracy; because it generates fewer features, it also runs considerably faster than Markovian. Other than that, it's pretty similar. It's sufficiently similar that OSB and Markovian can even use each other's .css files (with some decrease in accuracy). It's not recommended, but it works. --------------------------------------------------------------- The Winnow classifier Winnow is a different way of classifying; it doesn't generate probabilities but rather weights. The version in CRM114 at this particular time uses the OSB feature set. Christian Siefkes, Salendra Cchabra, and Laird Bryer did the first hacking on this, then with Fidelis Assis' OSB feature generator it really took off. Here's a quick synopsys of the algorithm: 1) Every possible feature, from AAAA to ZZZZZZZ, starts with a weight of 1.000000 (note, we only record weights that _aren't_ 1.000000; so if we don't find a feature in our feature list, we can assume it has a value of 1.0000). 2) To learn, we do these steps in order: - generate all of the OSB features in the example text - delete all duplicate features - if the example is an example "in the class", multiply every found feature's weight by the "Promotion Constant", which is empirically set at 1.23 - if the example is a text that is NOT supposed to be "in the class", we multiply each found feature's weight by the "Demotion Constant", which is empirically set at .83 (note that no matter how many times a feature appears in a particular example, it only gets promoted or demoted ONCE). 3) To classify, we do these steps in order: - generate all of the OSB features in the unknown text - delete all duplicate features - add up all of the weights of these features in each of the statistics files. (don't forget that any feature that doesn't exist in the stats file gets a default value of 1.00000 !) - The score of each file is the total weight accumulated by the per-features, divided by the total number of features. (note that since not-seen-before features score 1.0000, a totally inconclusive score is Nfeatures/Nfeatures = 1.0000) - The file with the highest score wins. Winnow works best when you add a "Thickness factor" correction, where you train not just on error, but rather in this less subtle way: If the _correct_ class didn't score at least "Thickness" above the decision threshold (in pR, the decision threshold is 0.0) then train the _correct_ class with the example text in correct (promotion) mode. If the _incorrect_ class didn't score at least "Thickness" below the decision threshold ( again, in pR units), then train the incorrect class in error (demotion) mode. This is done with the < refute > flag. Winnow is a well-known classification algorithm in pattern recognition, the current implementation will probably be upgraded and debugged in newer releases. ---------------------------------------------------------------- The Correlator classifier The correlator classifier is different! The correlator classifier slides the window of the unknown text across the known texts, and counts places where the same character appears in both... well, actually, it counts the sum of the squares of the runlengths of the matching strings, reiterated at each point in the string. If the letters don't match, nothing is counted. So, "The quick brown fox jumped over the lazy dog's back 0123456789", matched against "cat", will get just three points- one for the C in back matching the c in cat, and two for the a in cat matching the a's in lazy and back. (note that the T in The doesn't match the t in cat, because they're different cases). However, "lawn fog" will match the five-character sequence "wn fo" giving 1 + 4 + 9 + 16 + 25 = 55 points. Note that EVERY POSSIBLE substring in the unknown text is compared against the known texts. This is Markovian with a major vengeance streak (or death wish, if you don't have lots of CPU and CPU cooling to spare. > 100x slowdown is entirely possible with this correlation classifier; consider yourself warned.). The databases of correlator classifiers is NOT compatible with the .css files of SBPH, OSB, Markovian, and Winnow classification. Don't even think of intermixing them. -------------------------------------------------------------- The format of a SBPH or OSB Markovian .css file (and, for winnow a .cow file) is a 64-bit hash of a feature (whether the feature is a single word, a bigram, or a full SBPH does not matter) and a 32-bit representation of the value. In .css files, the 32 bits is an unsigned integer showing the number of occurrences of this particular feature in the training set; in .cow files it's a 32-bit floating point weight; greater than 1.000 means "preponderance of evidence in favor", less than 1.000 means "preponderance of evidence against", thus a value of 1.000 exactly means "no information" (and in the case of a .cow file, like a .css file, 0.000 exactly, with all 64 words of hash == 0, means "unused slot"). For fast access, the first 32 bits of the hash ( called h1 in the code) is used as an index (modulo the length of the .css/.cow file) and that's the preferred slot location to put this data. If that slot is already in use, the next slot is used. If that is already in use, the _next next_ slot is used... and so on. This "next slot not yet used" is an overflow chain. For best performance, you want to keep the chains short. But that wastes a lot of space. 90-95 per cent utilization is a good compromise. Note that the time-to-find a slot (or find it's not there) goes with the length of the overflow chains- so long chains are _very_ bad for performance. I usually set a limit of 256 or even 128 on chains. Once you go past that limit, you need to start expiring old data out of the chain. You can do that by zapping out low-valued (not very significant) slots, but that means old, stale, but originally high-valued slots never expire. Another method would be to use an LRS (Least Recently Seen) tracking system, but that would use up a lot more disk space for the .css/.cow files- almost doubling it is the best estimate I have. "Microgrooming" adds a random factor. A feature is microgroomed if it's hash is equal to a pseudorandom number - and the microgrooming is merely a _lessening_ of the significance. If the significance of a slot drops below "saw it once", the slot is reclaimed for reuse. Note also we don't groom the whole .css/.cow file. We groom _only_ the chain that we noticed was too long. This minimizes how much data we lose in a microgroom (face it, database grooming/expiring is brain surgery with a butter knife... microgrooming is just using the serrations on the edge to minimize how much we scrape away). Note that this works pretty well- most of the slots ever used in a .css/.cow file contain only a single occurrence, so reclaiming a small fraction of them (currently 1 in 16, scattered randomly) is a good compromise. It also will eventually expire out even the largest feature if that feature is not ever retrained. (and the killer bug? Well, consider how we know we've reached end-of-chain. We see a zeroed slot. Microgrooming puts in a number of zeroed slots - each of which is seen as a chain terminator. BUT- when we microgroom, we need to re-check the locations of each slot's worth of data, to make sure it's findable - that is, it isn't separated from it's optimal location by a freshly zeroed slot (which would indicate end-of-chain). This is "repacking" the chain. And the code that did it had a bug that repacked only the first part of the chain and then stopped. This meant that the tail of the chain (avg 50% or so) could NOT be found- the data there was lost! This bug has now been (hopefully) stomped. -Bill Yerazunis