19 Aug 2004 crm114 20040816.BlameClockworkOrange-auto.3
1. | ||
2. | ||
3. | ||
4. | ||
5. | ||
6. | ||
7. | ||
8. | ||
9. | ||
10. | ||
11. | ||
12. | ||
13. | ||
14. | ||
15. | ||
16. | ||
17. |
crm114 - The Controllable Regex Mutilator
crm [-d N (enter debugger after running N cycles. Omitting N means N equals 0.)] [-e (do not import any environment variables)] [-h (print help text)] [-p (generate an execution-time-spent profile on exit)] [-P N (max program lines)] [-q m (mathmode (0,1 = alg/RPN only in EVAL, 2,3 = alg/RPN everywhere))] [-s N (new feature file (.css) size is N (default 1 meg+1 featureslots))] [-S N (new feature file (.css) size is N rounded to 2^I+1 featureslots)] [-t (user trace output)] [-T (implementors trace output (only for the masochistic!))] [-u dir (chdir to directory dir before starting execution)] [-v (print CRM114 version identification and exit)] [-w N (max data window (bytes, default 16 megs))] [-- (signals the end CRM114 flags; prior flags are not seen by the user program; subsequent args are not processed by CRM114)] [--foo (creates the user variable :foo: with the value SET)] [--x=y (creates the user variable :x: with the value y)] [-{ stmts} (execute the statements inside the {} brackets)] crmfile (.crm file name)
CRM114's unique strengths are the data structure (everything is a string and a string can overlap another string), it's ability to work on truly infinitely long input streams, it's ability to use extremely advanced classifiers to sort text, and the ability to do approximate regular expressions (that is, regexes that don't quite match) via the TRE regex library.
CRM114 also sports a very powerful subprocess control facility, and a unique syntax and program structure that puts the fun back in programming (OK, you can run away screaming now). The syntax is declensional rather than positional; the type of quote marks around an argument determine what that argument will be used for.
The typical CRM114 program uses regex operations more often than addition (in fact, math was only added to TRE in the waning days of 2003, well after CRM114 had been in daily use for over a year and a half).
In other words, crm114 is a very very powerful mutagenic filter that happens to be a programming language as well.
The filtering style of the CRM-114 discriminator is based on the fact that most spam, normal log file messages, or other uninteresting data is easily categorized by a few characteristic patterns (such as "Mortgage leads", "advertise on the internet", and "mail-order toner cartridges".) CRM114 may also be useful to folks who are on multiple interlocking mailing lists.
In a bow to Unix-style flexibility, by default CRM114 reads it's input from standard input, and by default sends it's output to standard output. Note that the default action has a zero-length output. Redirection and use of other input or output files is possible, as well as the use of windowing, either delimiter-based or time-based, for real-time continuous applications.
CRM114 can be used for other than mail filtering; consider it to be a version of grep with super powers. If perl is a seventy-bladed swiss army knife, CRM114 is a razor-sharp katana that can talk.
CRM114 can be directly invoked by the shell if the first line of your program file uses the shell standard, as in:
#! /usr/bin/crm
You can use CRM114 flags on the shell-standard invocation line, and hide them with '--' from the program itself; '--' incidentally prevents the invoking user from changing any CRM114 invocation flags.
Flags should be located after any positional variables on the command line. Flags are visible as :_argN: variables, so you can create your own flags for your own programs (separate CRM114 and user flags with '--'). Two examples on how to do this:
./foo.crm bar mugga < baz -t -w 150000
./foo.crm -t -w 1500000 -- bar < baz mugga
One example on how not to do this:
./foo.crm -t -w 150000 bar < baz mugga
(That's WRONG!)
You can put a list of user-settable vars on the #!/usr/bin/crm invocation line. CRM114 will print these out when a program is invoked directly (e.g. "./myprog.crm -h", not "crm myprog.crm -h") with the -h (for help) flag. (note that this works ONLY on bash on Linux- *BSD's have a different bash interpretation and this doesn't work)
Example:
#!/usr/bin/crm -( var1 var2=A var2=B var2=C )
This allows only var1 and var2 be set on the command line. If a variable is not assigned a value, the user can set any value desired. If the variable is equated to a set of values, those are the only values allowed.
Another example:
#!/usr/bin/crm -( var1 var2=foo ) --
This allows var1 to be set to any value, var2 may only be set to either foo or not at all, and no other variables may be set nor may invocation flags be changed (because of the trailing "--"). Since "--" also blocks '-h' for help, such programs should provide their own help facility.
Examples :here:, :ThErE:, :every-where_0123+45%6789:, :this_is_a_very_very_long_var_name_that_does_not_tell_us_much:. Builtin variables:
:_nl: |
newline
|
|
:_ht: |
horizontal tab
|
|
:_bs: |
backspace
|
|
:_sl: |
a slash
|
|
:_sc: |
a semicolon
|
|
:_arg0: thru :_argN: |
command-line args, including all flags
|
|
:_argc: |
how many command line arguments there were
|
|
:_pos0: thru :_posN: |
positional args ('-' or '--' args deleted)
|
|
:_posc: |
how many positional arguments there were
|
|
:_pos_str: |
all positional arguments concatented
|
|
:_env_whatever: |
environment value 'whatever'
|
|
:_env_string: |
all environmental arguments concatenated
|
|
:_crm_version: |
the version of the CRM system
|
|
:_dw: |
the current data window contents
|
You can also use the standard constant C '\' characters, such as "\n" for newline, as well as excaped hexadecimal and octal characters like \xHH and \oOOO but these are constants, not variables, and cannot be redefined.
Depending on the value of "math mode" (flag -q). you can also use :#:string_or_var: to get the length of a string, and :@:string_or_var: to do basic mathematics and inequality testing, either only in EVALs or for all var-expanded expressions. See "Sequence of Evaluation" below for more details.
Variables don't get their own storage unless you ISOLATE them (see below), instead variables are start/length pairs indexing into the default data window. Thus, ALTERing an unISOLATEd variable changes the value of the default data buffer itself. This is a great power, so use it only for good, and never for evil.
\ | |||||||||||||||||||||||||||||||||||||||||
'\' is the string-text escape character. You only need to
escape the literal representation of closing delimiters inside var-expanded
arguments.
You can use the classic C/C++ \-escapes, such as \n, \r, \t, \a, \b, \v, \f, \0, and also \xHH and \oOOO for hex and octal characters, respectively. A '\' as the last character of a line means the next line is just a continuation of this one. A \-escape that isn't recognized as something special isn't an error; you may optionally escape any of the delimiters >, ) ] } ; / # \ and get just that character. A '\' anywhere else is just a literal backslash, so the regex ([abc])\1 is written just that way; there is no need to double-backslash the \1 (although it will work if you do). |
|||||||||||||||||||||||||||||||||||||||||
# this is a comment | |||||||||||||||||||||||||||||||||||||||||
# and this too \# | |||||||||||||||||||||||||||||||||||||||||
A comment is not a piece of preprocessor
sugar -- it is a statement and ends at the newline or at "\#".
|
|||||||||||||||||||||||||||||||||||||||||
insert filename | |||||||||||||||||||||||||||||||||||||||||
inserts the file verbatim at this line at compile
time.
|
|||||||||||||||||||||||||||||||||||||||||
; | |||||||||||||||||||||||||||||||||||||||||
statement separator - must ALWAYS be escaped as \; unless it's
inside delimiters or else it will mark the end of the statement.
|
|||||||||||||||||||||||||||||||||||||||||
{ and } | |||||||||||||||||||||||||||||||||||||||||
start and end blocks of statements. Must always be '\'
escaped or inside delimiters or these will mark the start/end of a
block.
|
|||||||||||||||||||||||||||||||||||||||||
noop | |||||||||||||||||||||||||||||||||||||||||
no-op statement
|
|||||||||||||||||||||||||||||||||||||||||
:label: | |||||||||||||||||||||||||||||||||||||||||
define a GOTOable label
|
|||||||||||||||||||||||||||||||||||||||||
accept | |||||||||||||||||||||||||||||||||||||||||
writes the current data window to standard output; execution
continues.
|
|||||||||||||||||||||||||||||||||||||||||
alius | |||||||||||||||||||||||||||||||||||||||||
if the last bracket-group succeeded, ALIUS skips to end of {}
block (a skip, not a FAIL); if the prior group FAILed, ALIUS does
nothing. Thus, ALIUS is both an ELSE clause and a CASE statement.
|
|||||||||||||||||||||||||||||||||||||||||
alter (:var:) /new-val/ | |||||||||||||||||||||||||||||||||||||||||
destructively change value of var to newval;
(:var:) is var to change (var-expanded); /new-val/ is value to change
to (var-expanded).
|
|||||||||||||||||||||||||||||||||||||||||
classify <flags> (:c1:...|...:cN:) (:stats:) [:in:] /word-pat/ | |||||||||||||||||||||||||||||||||||||||||
compare the statistics of the current data window buffer with classfiles
c1...cN.
|
|||||||||||||||||||||||||||||||||||||||||
eval (:result:) /instring/ | |||||||||||||||||||||||||||||||||||||||||
repeatedly evaluates /instring/ until it
ceases to change, then places that result
as the value of :result: . EVAL uses
smart (but foolable) heuristics to avoid
infinite loops, like evaluating a string
that evaluates to a request to evaluate
itself again. The error rate is about
1 / 2^62 and will detect chain groups of
length 255 or less.
If the instring uses math evaluation
(see section below on math operations)
and the evaluation has an inequality
test, (>, < or =) then if the inequality
fails, the EVAL will FAIL to the end of
block. If the evaluation has a numeric
fault (e.g. divide-by-zero) the EVAL will
do a TRAPpable FAULT.
|
|||||||||||||||||||||||||||||||||||||||||
exit /:retval:/ | |||||||||||||||||||||||||||||||||||||||||
ends program execution. If supplied, the
return value is converted to an integer
and returned as the exit code of the
crm114 program. If no retval is supplied,
the return value is 0.
|
|||||||||||||||||||||||||||||||||||||||||
fail | |||||||||||||||||||||||||||||||||||||||||
skips down to end of the current { } block
and causes that block to exit with a FAIL
status (see ALIUS for why this is useful)
|
|||||||||||||||||||||||||||||||||||||||||
fault /faultstr/ | |||||||||||||||||||||||||||||||||||||||||
forces a FAULT with the given string as
the reason. The fault string is
val-expanded.
|
|||||||||||||||||||||||||||||||||||||||||
goto /:label:/ | |||||||||||||||||||||||||||||||||||||||||
unconditional branch (you can use a variable as the
goal, e.g. /:*:there:/ )
|
|||||||||||||||||||||||||||||||||||||||||
hash (:result:) /input/ | |||||||||||||||||||||||||||||||||||||||||
compute a fast 32-bit hash of the /input/,
and ALTER :result: to the
hexadecimal hash value. HASH is
not warranted to be constant across
major releases of CRM114, nor is it
cryptographically secure.
|
|||||||||||||||||||||||||||||||||||||||||
intersect (:out:) [:var1: :var2: ...] | |||||||||||||||||||||||||||||||||||||||||
makes :out: contain the part of
the data window that is the intersection of
:var1 :var2: ... ISOLATEd vars are ignored.
This only resets the value of the captured
variable, and does NOT alter any text in
the data window.
|
|||||||||||||||||||||||||||||||||||||||||
isolate (:var:) /initial-value/ | |||||||||||||||||||||||||||||||||||||||||
puts :var: into a data area outside of
the data buffer; subsequent changes to this
var don't change the data buffer (though
they may change the value of any var
subsequently set inside of this var).
If the var already was ISOLATED, this is
a noop.
|
|||||||||||||||||||||||||||||||||||||||||
input <flags> (:result:) [:filename:] | |||||||||||||||||||||||||||||||||||||||||
read in the content of filename.
If no filename, then read stdin
|
|||||||||||||||||||||||||||||||||||||||||
learn <flags> (:class:) [:in:] /word-pat/ | |||||||||||||||||||||||||||||||||||||||||
learn the statistics of
the :in: var (or the input window if no
var) as an example of class :class:
|
|||||||||||||||||||||||||||||||||||||||||
liaf | |||||||||||||||||||||||||||||||||||||||||
skips UP to START of the current {} block (LIAF is FAIL
spelled backwards)
|
|||||||||||||||||||||||||||||||||||||||||
match <flags> (:var1: ...) [:in:] /regex/ | |||||||||||||||||||||||||||||||||||||||||
Attempt to match the given
regex; if match succeds, variables are bound; if match fails, program
skips to the closing '}' of this block
|
|||||||||||||||||||||||||||||||||||||||||
output <flags> [filename] /output-text/ | |||||||||||||||||||||||||||||||||||||||||
output an arbitrary string
with captured values expanded.
|
|||||||||||||||||||||||||||||||||||||||||
syscall <flags> (:in:) (:out:) (:status:) /command/ | |||||||||||||||||||||||||||||||||||||||||
execute a shell
command
|
|||||||||||||||||||||||||||||||||||||||||
trap (:reason:) /trap_regex/ | |||||||||||||||||||||||||||||||||||||||||
traps faults from both FAULT statements
and program errors occurring anywhere in the preceding bracket-block. If
no fault exists, TRAP does a SKIP to end of block. If there is a fault
and the fault reason string matches the trap_regex, the fault is trapped,
and execution continues with the line after the TRAP, otherwise the fault
is passed up to the next surrounding trapped bracket block.
|
|||||||||||||||||||||||||||||||||||||||||
union (:out:) [:var1: :var2: ...] | |||||||||||||||||||||||||||||||||||||||||
makes :out: contain the union of
the data window segments that contains var1, var2... plus any intervening
text as well. Any ISOLATEd var is ignored. This is non-surgical, and
does not alter the data window
|
|||||||||||||||||||||||||||||||||||||||||
window <flags> (:w-var:) (:s-var:) /cut-regex/ /add-regex/ | |||||||||||||||||||||||||||||||||||||||||
window
slider. This deletes to and including the cut-regex from :var: (default:
use the data window), then reads adds from std. input till add-regex
(inclusive).
|
Matches are, by default "first starting point that matches, then longest match possible that can fit".
a through z | ||
A through Z | ||
0 through 9 | ||
all match themselves.
|
||
most punctuation | ||
matches itself, but check below!
|
||
* | ||
repeat preceding 0 or more times
|
||
+ | ||
repeat preceding 1 or more times
|
||
? | ||
repeat preceding 0 or 1 time
|
||
*?, +?, ?? | ||
repeat preceding, but shortest match that fits, given
the already-selected start point of the regex. (only
supported by TRE regex, not GNU regex)
|
||
[abcde] | ||
any one of the letters a, b, c, d, or e
|
||
[a-q] | ||
the letters a through q (just one of them)
|
||
{n,m} | ||
repetition count: match the preceding at least n and no more
than m times (POSIX restricts this to a maximum of 255
repeats)
|
||
[[:<:]] | ||
matches at the start of a word (GNU regex only)
|
||
[[:>:]] | ||
matches the end of a word (GNU regex only)
|
||
^ | ||
as first char of a match, matches the start of a line (ONLY in
<nomultiline> matches.
|
||
$ | ||
as last char of a match, matches at the end of a line (ONLY in
<nomultiline> matches)
|
||
. | ||
(a period) matches any single character (except start-of-line or
end of line "virtual characters", but it does match a newline).
|
||
a|b | ||
match a or b
|
||
(match) | ||
the () go away, and the string that matched inside is
available for capturing. Use \\( and \\) to match actual
parenthesis (the first '\' tells "show the second '\' to
the regex engine, the second '\' forces a literalization
onto the parenthesis character.
|
||
\n | ||
matches the N'th parenthesized subexpression. Remember to
backslash-escape the backslash (e.g. write this as \\1)
This is only if you're using TRE, not GNU regex.
|
[[:alnum:]] [[:alpha:]] [[:blank:]] [[:cntrl:]] [[:digit:]] [[:lower:]] [[:upper:]] [[:graph:]] [[:print:]] [[:punct:]] [[:space:]] [[:xdigit:]][[:graph:]] matches any character that puts ink on paper or lights a pixel. [[:print:]] matches any character that moves the "print head" or cursor.
1 |
\-constants like \n, \o377 and \x3F are substituted
|
||||||||||
2 |
:*:var: variables are substituted (note the difference between
a constant like '\n' and a variable like ":*:_nl:" here - constants
are substituted first, then variables are substituted.)
|
||||||||||
3 |
:#:var: string-length operations are performed
|
||||||||||
4 |
:@:expression: mathematical expressions are performed; syntax is
either RPN or non-precedenced (parens required) algebraic
notation. Embedded non-evaluated strings in a mathematical
expression is currently a no-no.
Allowed operators are: + - * / % > < = only. Only >, <, and = set logical results; they also evaluate to 1 and 0 for continued chain operations - e.g. ((:*:a: > 3) + (:*:b: > 5) + (:*:c: > 9) > 2)is true IFF any of the following is true
|
Approximate matching is specified similarly to a "repetition count" in a regular regex, using brackets. This approximation applies to the previous parenthesized expression (again, just like repetion counts). You can specify maximum total changes, and how many inserts, deletes, and substitutions you wish to allow. The minimum-error match is found and reported, if it exists within the bounds you state.
The basic syntax is:
(text-to-match){~[maxerrs] [#maxsubsts] [+maxinserts] [-maxdeletes]}Note that the '~' (with an optional maxerr count) is required (that's how we know it's an approximate regex rather than just a rep-count); if you don't specify a max error count, you will get the best match, if you do, the match will have at most that many errors.
Remember that you specify the changes to the text in the pattern necessary to make it match the text in the string being searched.
You cannot use approximate regexes and backrefs (like \1) in the same regex. This is a limitation of in TRE at this point.
You can also use an inequality in addition to the basic syntax above:
(text-to-match){~[maxerrs] [basic-syntax] [nI + mD + oS < K] }where n, m, and o are the costs per insertion, deletion, and substitution respectively, 'I', 'D', and 'S' are indicators to tell which cost goes with which kind of error, and K is the total cost of the errors; the cost of the errors is always strictly less than K. Here are some examples.
(foobar) | ||
exactly matches "foobar"
|
||
(foobar){~} | ||
finds the closest match to "foobar", with the minimum
number of inserts, deletes, and substitutions. Always succeeds.
|
||
(foobar){~3} | ||
finds the closest match to "foobar", with no more than
3 inserts, deletes, or substitutions
|
||
(foobar){~2 +2 -1 #1) | ||
find the closest match to "foobar", with at
most two errors total, and at most two inserts, one delete, and one
substitution.
|
||
(foobar){~4 #1 1i + 2d < 5 } | ||
find the closest match to "foobar",
with at most four errors total, at most one substitution, and with the
number of insertions plus 2x the number of deletions less than 5.
|
||
(foo){~1}(bar){~1) | ||
find the closest match to "foobar", with at
most one error in the "foo" and one error in the "bar".
|
Unlike most computer languages, CRM114 uses inflection (or declension) rather than position to describe what role each part of a statement plays. The declensions are marked by the delimiters- the /, ( and ), < and >, and [ and ].
By and large, you can mix up the arguments to each kind of statement without changing their meaning. Only the ACTION needs to be first. Other parts of the statement can occur in any order, save that multiple (paren_args) and /pattern_args/ must stay in their nominal order but can go anywhere in the statement. They do not need to be consecutive.
The parts of a CRM114 statement are:
ACTION |
the verb. This is at the start of the statement.
|
|
/pattern/ |
the overall pattern the verb should use, analogous to the
"subject" of the statement.
|
|
<flags> |
modifies how the ACTION does the work. You'd call these
"adverbs" in human languages.
|
|
(vars) |
what variables to use as adjuncts in the action (what would
be called the "direct objects"). These can get changed when the action
happens.
|
|
[limited-to] |
where the action is allowed to take place (think of it
as the "indirect object"). These are not directly changed by the action.
|
The CRM114 homepage is at http://crm114.sf.net/ .
This manpage describes the crm114 utility as it has been described by QUICKREF.txt, shipped with crm114-20040212-BlameJetlag.src.tar.gz. The DESCRIPTION section is copy-and-pasted from INTRO.txt as distributed with the same source tarball.
Converted from plain ascii to zoem by Joost van Baal.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program (see COPYING); if not, check with http://www.gnu.org/copyleft/gpl.html or write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111, USA.