Sunday, March 22, 2009

Rulebase Benchmarks 2009


In concert with certain other "technical" persons in this field (Dr. Forgy, Gary Riley, et al) I would like to propose a new-and-improved benchmark for rulebased systems. Waltz and WaltzDB are still valid benchmarks but, unfortunately, most folks (read: programmers) have trouble visualizing the lines of logic that would help move lines from a 2D drawing to a 3D drawing. This is elementary for a "normal" engineer because they have had to endure Mechanical Engineering 101, aka Drafting Class.

I was "privileged" to see some recent emails from a "Benchmark" consortium and the vendors (surprise of surprises) did not like the Waltz nor the WaltzDB benchmarks because they were not "real world." First, the vendors have not defined what is a "real world" and second, since when was a benchmark "real world"?? The "real world" of most vendors these days is composed of financial problems that, while they are sometimes quite large, are never complex. (Think Abstract Algebra or Partial Differential Equation level of "complexity.")

WHY would we need a rulebase that can handle both massive numbers of rules and objects as well as very complex rules? Think about Homeland Security where there are thousands of Ports of Entry (POE) as well as perhaps a million travelers every day and millions of cargo boxes being shipped into the USA. The database query alone could take days when you have only a few minutes to determine (1) does a threat exist and (2) what is the level of that threat? An automobile of a specific nature parked too long in one place (think OKC) could be a "clue" on the threat level. A new object of a certain type on a roadway (think IED) is a potential threat in some areas but not in others.

One person from Saudi might not be (most likely is not) a threat. But knowing that he/she is related to another person of the same village entering at another port combined with another person of the same village either already here or entering from another port might quite possibly raise the threat level. (Meaning, what are the odds of three persons from the same, small village in Saudi Arabia entering the USA over a one, two or three day period?) These decisions sometimes have to be done within seconds while a person is standing at the counter - not hours later when that same person has been passed through and disappears into the local population.

Think of what happens in health underwriting when the things that must be considered are many and related. For example, a bad back (or knee or foot or whatever) could lead to declining health and possible heart attack depending on the severity of the injury. A heart attack could lead to even more declining health and death. Family history can and does play a huge part in underwriting. For example, being overweight means a potential increased risk in diabetes. If most everyone in the family has had diabetes (of either type) then the risks escalate. Having a family history of heart problems as well makes the problem even riskier. This is a large and complex problem that needs fast resolution. Assuming, of course, that the data are available in the first place. The reasoning process here is (can be) extremely complex and most times the human underwriter is the only person who can make that kind of determination when a rulebase would be a much better approach.

Fraud detection is a complex issue that is normally addressed from a superficial viewpoint rather than something "in depth" that might be reasonably accurate. Some of the issues of fraud detection (or homeland security or underwriting) could be handled with Rule-Based Forecasting (RBF) system as well as possibly linking the rulebase with Neural Networks to help predict what will happen. It has been shown (back in 1989) that neural net was much better at forecasting a time-dependent series than even the far more popular Box-Jenkins method of analysis and forecasting. It isn't much good at non-temporal situations but that's another story.

Let us return to our primary discussion of what should compose a rulebase benchmark. A rulebase benchmark should be composed of several tests:

Forward Chaining
Backward Chaining
Complex Rules
Rules with a high level of Specificity
Lots of (maybe 100 or more) "simple" rules that chain between themselves
Stress the conflict resolution strategy
Stress pattern matching

Just having an overwhelming amount of data is not sufficient for a rulebase benchmark - that would be more in line with a test of the database efficiency and/or the available memory. Further, it has been "proven" over time that compiling rules into Java code or into C++ code (something that vendors call "sequential rules") is much faster than using the inference engine. True, and it should be. After all, most inference engines are based in Java or C++ code and the rules are merely an extension, another layer of abstraction, if you will. But sequential rules do not have the flexibility of the engine and, in most cases, have to be "manually" arranged so that they fire in the correct order. An inference engine, being non-monotonic, does not have that restriction.

Simply put, most rulebased systems cannot pass muster on the simple WaltzDB-16 benchmark. We now have a WaltzDB-200 test should they want to try something more massive.

New Benchmarks: Perhaps we should try some of the NP-hard problems - that would eliminate most of the "also ran" tools. Also, perhaps we should be checking on the "flexibility" of a rulebase by processing on multiple platforms (not just Windows) as well as checking performance and scalability on multiple processors; perhaps 4, 8 or 16 (or more) CPU machines. An 8/16 CPU Mac is now available at a reasonable price as is the i7 Intel (basically 4/8 cores) CPU. But these are 64-bit CPUs and some rule engines are not supported for 64-bit platforms. Sad, but true. Some won't even run on Unix but only on LInux. Again, sad, but true.

So, any ideas? I'm thinking that someone, somewhere has a better suggestion than 64-Queens or Sudoku. Hopefully...


1 comment:

woolfel said...

Here is an example rule, which is based on a real health insurance rule.

- the person already tried claritin with no improvement
- the person already tried zyrtec with no improvement
- the age is over 10
- the person has PPO plan
- the doctor has approved allegra within the last 5 years
- pre-approve the customer for allegra

- the person already tried claritin with no improvement
- the person has not tried zyrtec
- recommend the customer try zyrtec and consult with their PCP

In terms the number of rules, it's common to have 10K+ within a single ruleset. As far as I know, the top 3 health insurance companies all have this type of use case. All of them use some kind of rule engine to automate pre-approval. Some use commercial products like blaze and jrules, while others write their own custom rule engine.