Saturday, March 21, 2009

Benchmarks 2009


In concert with certain other "technical" persons in this field (Dr. Forgy, Gary Riley, et al) I would like to propose a new-and-improved benchmark for rulebased systems. Waltz and WaltzDB are still valid benchmarks but, unfortunately, most folks (read: programmers) have trouble visualizing the lines that would help move from a 2D drawing to a 3D drawing. This is elementary for a real engineer because they have had to endure Mechanical Engineering 101, aka Drafting Class.

I was "privileged" to see some recent emails from a "Benchmark" consortium and the vendors (surprise of surprises) did not like the Waltz nor the WaltzDB benchmarks because they were not "real world." First, they have not defined what is a "real world" and second, since when was a benchmark "real world"?? The "real world" of most vendors these days is composed of financial problems that, while they are sometimes quite large, are never complex. (Think Abstract Algebra or Partial Differential Equation level of "complex.")

WHY would we need a rulebase that can handle both massive and complex? Think about Homeland Security where there are thousands of Ports of Entry (POE) as well as perhaps a million travelers every day and millions of cargo boxes being shipped into the USA. The database query alone could take days when you have only a few minutes to determine (1) does a threat exist and (2) what is the level of that threat? An automobile of a specific nature parked too long in one place could be a "clue" on the threat level. A new object of a certain type on a roadway is a potential threat in some areas but not in others.

One person from Saudi might not be (most likely is not) a threat. But knowing that he/she is related to another person of the same village entering at another port combined with another person of the same village either already here or entering from another port definitely raises the threat level. (Meaning, what are the odds of three persons from the same village entering the USA over a one, two or three day period?) These decisions sometimes have to be done within seconds while a person is standing at the counter - not hours later when that same person has been passed through and disappears into the local population.

Think of what happens in health underwriting when the things that must be considered are many and related. For example, a bad back (or knee or foot or whatever) could lead to declining health and possible heart attack depending on the severity of the injury. A heart attack could lead to even more declining health and death. Family history can and does play a huge part in underwriting. For example, being overweight means a potential increased risk in diabetes. If everyone in the family has had diabetes (of either type) then the risks escalate. Having a family history of heart problems as well makes the problem even riskier. This is a large and complex problem that needs fast resolution. Assuming, of course, that the data are available in the first place. The reasoning process here is (can be) extremely complex and most times the human underwriter is the only person who can make that kind of determination when a rulebase would be a much better approach.

Fraud detection is a complex issue that is normally addressed from a superficial viewpoint rather than something "in depth" that might be reasonably accurate. Some of the issues of fraud detection (or homeland security or underwriting) could be handled with Rule-Based Forecasting (RBF) system as well as possibly linking the rulebase with Neural Networks to help predict what will happen. It has been shown (back in 1989) that neural net was much better at forecasting a time-dependent series than even the far more popular Box-Jenkins method of analysis and forecasting.

But, I digress. Let us return to our primary discussion of what should compose a rulebase benchmark. A rulebase benchmark should be composed of several tests:

Forward Chaining
Backward Chaining
Complex Rules
Rules with a high level of Specificity
Lots of (maybe 100 or more) "simple" rules that chain between themselves

Just having an overwhelming amount of data is not sufficient for a rulebase benchmark - that would be more in line with a test of the database efficiency and/or the available memory. Further, it has been "proven" over time that compiling rules into Java code or into C++ code (something that vendors call "sequential rules") is much faster than using the inference engine. True, and it should be. After all, most inference engines are based in Java or C++ code and the rules are merely an extension. But sequential rules do not have the flexibility of the engine and, in most cases, have to be "manually" arranged so that they fire in the correct order. An inference engine, being non-monotonic, does not have that restriction.

Simply put, most rulebased systems cannot pass muster on the simple WaltzDB-16 benchmark. We now have a WaltzDB-200 test should they want to try something more massive.

New Benchmarks: Perhaps we should try some of the NP-hard problems - that would eliminate most of the "also ran" tools. Also, perhaps we should be checking on the "flexibility" of a rulebase by processing on multiple platforms (not just Windows) as well as checking performance and scalability on multiple processors; perhaps 4, 8 or 16 (or more) CPU machines. An 8/16 CPU Mac is now available at a reasonable price as is the i7 Intel (basically 4/8 cores) CPU. But these are 64-bit CPUs and some rule engines are not supported for 64-bit platforms. Sad, but true. Some won't even run on Unix but only on Linux or only on Windows.  Again, sad, but true.

Anyway, the next blog on ORF 2009 will be about Ms. Manners - the new version where we don't tell you HOW to solve the problem (we don't give you the rules) but you have to get the right answer.  Probably, in order to get the "right" answer we will have to provide the data.  Unless, of course, some kind vendor would find a college intern to do that.  :-)  


No comments: