Things To Do in CRM114 (rough priority list) 1) redirect OUTPUT (DONE!) "output (filename) /interp_val/ 2) write WINDOW (timeout) /dropregex/ /readregex/ drops to end of dropregex, then reads in till readregex or timeout. (DONE! Well, at least done enough to parse a syslog.) (DONE!) 6) get MATCH to bridge end-of-line. (DONE!) 8) get MATCH to work on fromnext/fromend/tillnew ( DONE! ) 4) write ALTER ( DONE!!) 12) make and (vars) optional, rather than having matches wierdly succeed if they aren't present. ( DONE! ) 3) write LEARN and CLASSIFY (DONE!) 9) get argv (or the command line) and the environmment into the VHT table. ( DONE !!!) 15) if WINDOW is first statement, don't suck in till EOF, let window do the work. ( DONE!!!) 16) let WINDOW check for match by character, newline, or eof (flags (with flags "bychar", "byline", and "byeof"). (DONE except for some skankyness when doing bychar on a TTY) 5) allow MATCH on variables or other files. (: means var, otherwise it's a file? Or should we have an explicit flag for other-than-input?) ( DONE!!! ) 13) write ISOLATE (DONE !!!) 7) get EXEC to run other executables (see netcat for help) (DONE!!!) 18) fix cdw/tdw bug in ALTER. ( DONE !!!) 19) fix EOF bug in SYSCALL ( DONE !!) 20) get rid of all those "unused variable" warnings. ( DONE!!! ) **** prealpha release here **** (DONE!!!) * Require that ';' be escaped unless it's an end-of-statement. ( DONE!!!) * Have preprocessor do "#insert"s. ( DONE!!! ) * Make Quick Reference Card .txt ( DONE!!!) * Write ReadMe ( DONE !!!) ***** alpha release here ****** 10) allow command-line program input {execute anything inside curly braces} ( DONE!!! ) 17) fix memory leak in INPUT... make it an ALTER. ( DONE!!! ) * Make a reasonable way to to keep an "SYSCALL" process around, so the exec process retains useful state. (DONE!!!) * Fix bug in output so that a /'ed filename isn't taken as the output pattern. ( DONE !!!!) * Fix bug in output to allow appending to a file with a " +" on the filename. ( DONE!!! ) * Write up "N useful idioms" in ReadMe ( DONE !!! ) ***** FRESHMEAT release here ***** * have previous match start and end points be part of the variable being matched against, not against the buffer (cdw or tdw) being matched against. ( DONE!!!) * maybe make a builtin variable for the entire data window, maybe call it :_dw: ? Make :_dw: the default matched variable. ( DONE !!!) * bugfix: change :*: to evaluate _once_, not iteratively... and have it evaluate unset variables to their valu sans :'s. (DONE!!!) * merge cdw and tdw into one big string storage space. This would make the autosizing problem somewhat easier. (DEFERRED -maybe not - will really slow down string alterations) * bugfix - should give a better error msg than SEGFAULTing when SYSCALLing with zero or one (paren) argument. (DONE!!!) * bugfix - if you use a '[' ( including [[:foo:]] patterns) in the match pattern, the '[' gets sucked in as a sub=space match. The temporary workaround is to always specify a subspace match first, but that's a kludge. The right fix is to do a better microcompiler parser. (DONE!!!) * let INPUT specify a file, rather than the kludge of "cat myfile.txt", which is also a kludge. (DONE!!!) * Add "union" and "intersection" operators (partially as a workaround of the bug in the Gnu Regex library that causes patterns of the form ".*literal" to fail when line-spanning is turned on and the searched string is more than 20479 characters long (ask how I found that bug!) (DONE!!!) * Change :_data_window_: to be :_dw:, newline to :_nl:, etc... (DONE!!!) * Make a real Bison-type parser (optional). (SORTA DONE!) 8) come up with generalized flag finder for match, output, etc. (nonpositional parser?) Issue: what about giving an error on a nonrecognized flag? The current nonpositional finder just ignores them. ( DONE!!! ) 21) make INPUT have consistent behavior across both stdin and files (i.e. always read to EOF unless by-line is requested) (DONE!!!) 22) clean up SYSCALL and OUTPUT so that () is variables and [] is outputs. (low priority) (CANCELLED - think on this more) 24) have EXIT return an output value.(DONE!!!!) 25) extend the mail filter to pop open BASE64 stuff before doing a classify. ( DONE !!! ) 30) Go to a full Bayesian match rather than a raw score match. ( DONE !!! ) *) have the gz file unpack into a subdirectory "install", not in the current dir. (DONE!!!) 32) fix OUTPUT so the flag does what + does now. (DONE !!!) 33) add a steppable tracer/debugger. Or at least the ability to single-step if desired. (-t, -t nnn = run nnn statements, then go into single step mode. In singlestep, p /whatever/ pretty-prints whatever, a :var: /whatever/ alters var, n single-steps the next statement, c continues execution at full speed, d / D toggle user and system debugging, any decimal number runs that many more statements, h gives you help. ( DONE !!! ) 35) 8-bit-safe the data paths ( DONE EXCEPT FOR THE REGEX ENGINE ) (DONE!!!) 34) Decide to change #insert to just plain "insert" . (***DONE***) 39) add "kalman-belly-gaze" mode to autotrain on some fraction of the processed but unverified text. Definitely need to flag this forward to the user so the user can detrain. (***DONE***) 31) add ELSE and BBREAK (else flips sense of skip when encountered, BBREAK skips immediately to closing Bracket without doing any ELSEstuff). Maybe need to think about this some more? (****DONE by ALIUS statement ****) 45) add a fault catcher along the lines of ALIUS for catchable errors (***DONE**** with FAULT and TRAP) 44) add options to cssutil to rescale overflowing bins. (***DONE***) 47) add a third error class beyond FATAL - APOCALYPTIC errors, which are so bad that the trap system can't gaurantee a clean recovery. (***DONE***) 41) add a third class/result from the classifier: "Neither", where the number of hits in both categories is lower than would be statistically expected (i.e. it's an email spam that's being intentionally evasive). (*** SUBSUMED *** - see #48) 48) have a FAULT-NEITHER flag that compares the number of hits expected in the two categories to the actual number encountered. If the actual is way below the expected, throw a fault. (*** SUBSUMED *** - the :stat: variable now has the actual feature counts, and so the user can make their own decision) 49) have the preprocessor set the line number so that sometnhing more than "line 0" can be reported. (if possible). (***DONE***) 51) upgrade the .css files to have 32-bit ceilings and crosscut hashing to minimize hash-clash errors. (***DONE***) 37) Mailfilter: Add transforms to insert fake phrases based on heuristic (spamassassin like) features for things like lots of html, base64 armoring, so features can be seen in CLASSIFY. (*** DONE *** via rewrites.mfp) 50) fix TRAP so that it doesn't run the trap statement until it's sure that it can catch the trap. (in other words, don't run the statement evaluator. The reason for this is that nonfatals which would have been simply continued from the current statement now don't trap right.) (*** DONE ***) 47) have BASE64 expansion and spammus interruptus decommenting happen only if they're successful, otherwise they don't modify the matched text variable m_text. (***DONE***) 52) set up "extra stuff" so it's an attachment rather than just some text stuck on the end. Use cribbed formatting... (***DONE***) 53) let WINDOW work on any var, not just :_dw:. (***DONE***) 43) make CLASSIFY run against multiple categories, not just go/nogo. (separate categories with "|" and allow multiple .css files per category) (***DONE***) 57) HASH should take :_dw: by default (***DONE***) 11) decide if locations still need to be :whatever: (***YES*** that way, new statements can be added without problems) 46) get EVAL working properly (evaluates both sides till no more :*:, then does the assignment) (***DONE***) 54) let WINDOW take input from anywhere, not just stdin (i.e. from a file? from a variable?) (***DONE***) 58) WINDOW should take an input (:var:) as source and modify a [:var:]. See #54 (***DONE***) 23) let INPUT keep around a file that's already been partially read. (*** DONE *** - well, obviated, because you can now do random access reads and writes) 59) Put in "fseek" operators into INPUT and OUTPUT - but this needs a design - how do you delineate the fseeks? How do you have them compatible with substring operators in EVAL? (***DONE***) ********************************************************************* ************* Stuff not done yet below this line ******************** ********************************************************************* 14) write PERM (for now though, you can just exec cat >> them into a file, then suck 'em back in again.. so this isn't really all that much more powerful.) (Held in consideration...) 26) consider making the sizes of data windows and pattern buffers autosizing, so you can never "run out". 27) let INPUT use regexes to determine "end of input". 28) Add the proper hooks so that GDB et al can look at the executing process. 29) add time flags to syscall for timeouts; should have both absolute time and delta time since last successful output; unless is specified, looping continues until EOF occurs after waiting at least the timeout. Suggested values: to01, to1, to10, to100 (.1, 1, 10, 100 seconds total time), and dto01, dto1, dto10, dto100 (delta since last output of .1, 1, 10, 100 seconds.) Or come up with a way to cleanly parameterize this. 36) 8-bit-safe the program variable names 38) Mailfilter: send bounce messages when mailfilter rejects something??? 40) use 1/tokencount and 1-(1/tokencount) for probabilities. (or something from the Gary Robinson article) Witten-Bell = unseen words = P(wi|d) = ni+1 / N1 + seen_once LMMSE linear mean minimization estimator. (this is orthogonal to the switch to a Markov model) 42) make declensional parser aware of statement type, so it parses (and errors) on what it expects, rather than the general case of everything. (** half done - the flags do checking this way **) 55) create a setup "wizard" for creating mailfilterconfig.crm & .forward 56) create an "emacs mode" for CRM114 programs 60) Do "virtual variables" that are matched on-the-fly have any use? Or are they merely providing functionality that already exists? 61) Come up with useful semantics of CALL, ROUTINE, and RETURN. Should variables be mapped in, out, both? What about overlaps?