Thursday, July 24, 2008

Leveling the Playing Field for Rulebase Benchmarks

Greetings, Programs:

I have, in the past, published lots of benchmark results. Most were "fair" in my own eyes but recently Gary Riley has called my attention to the fact that they were not, in actuality, fair. For instance, some of them were done in the interpreted mode while some were done in the compiled mode. In the days of yore, I didn't have much of a choice since some of the rulebased vendors only had the compiled or the interpreted. Some would not allow external objects and others required external objects.

So, I have decided that we will start a whole new era of rulebase benchmarks. They will be broken down into several categories;

> Interpreted Rules

> Compiled Rules (usually some kind of OPS rulebase) for those that have such configurations and will include both inference mode and sequential mode

> Decision Table benchmark for Corticon, Visual Rules and other companies that use such tools.

Some companies will be able to do all three, such as ILOG (now IBM) and Fair Isaac. Others, such as Corticon and Visual Rules, will be able to do only the decision table version. OPSJ will be compiled only. I have determined that ALL of the three methods of using rules will HAVE to be able to use external objects, not internal objects. The reason for this is that "normal" business practices use external objects that have an API for calling or using these objects that adds abstraction to the objects - meaning that the objects can change, have things added to them, whatever and it should not affect in any way what happens with the old rules. New rules then can be written to access the old objects or the modified objects.

So, considering the first two categories, these would be conventional, OPS-type, IF-THEN-ELSE rules. What we don't want to do is resurrect the Miss Manners benchmark and probably not the Waltz benchmarks since both classes of benchmarks seem to rely on one rule firing over and over and over. Way too easy to cheat - and several closed-source companies have done so long ago.

But what if we came up with an NP-Complete problem that should be solved within a certain timeframe to just whether a rule engine can or can not be used within an viable commercial, business problem rule set AS WELL AS being used within the scientific and academic communities. I don't think that decision tables can do this kind of problem so we'll have to find a completely different benchmark for them, one that deals strictly with getting and reading the data, putting the data in the spreadsheet (OK, decision table) and then firing firing up the rule engine. Something like that. I'm open to suggestions so let me know if you can think of anything.

Meanwhile, congratulations again to Drools for best 2008 Open Source Rulebase/BRMS and to Fair Isaac Blaze Advisor for the best 2008 BRMS application. Well-deserved and well-earned.



woolfel said...

There are a couple of common business use cases that I've seen first hand. I included these in my list of "potential" benchmarks a long time ago.

Here's 3 different kinds of "realistic" business scenarios. it's not enough to write the rules. It's a general description of the type of rules.

Message filtering -
thousands of simple read only rules
objects with lots of fields
rules reason over 1 object without joins
the application is responsible for transforming some other object model and converting it to the flat format

Simple Calculation -
several hundred rules that modify an object with a calculated value
the objects tend to be normalized
rules reason over multiple objects and perform simple joins
each object may participate in multiple rule activations

Inference rules -
several hundred to thousands of rules that modify objects, and create new objects
the object model tends to be complex with dozens of entities
rule use chaining to control the business logic and may include simple calculation

Johan Lindberg said...


I'm a bit curious about why you think an NP-complete problem would serve well as a benchmark. I can think of several cons but don't really see any pros...

I like Peter's suggestion about using a common business use case. What about a decision-tree based object classifier? It can be made in various degrees of complexity and you can vary the data size to make it tougher if neccessary.

James Owen said...

NP-Complete Benchmarks (well, there aren't any yet so someone will have to write them) are almost "cheat free" - something very important to guys who benchmark almost every rulebase engine out there. I have found ways to cheat Miss Manners and Waltz and so have several vendors. The non-open-source vendors cheat "under the covers" and you can't prove it but you KNOW that they did when one benchmark improves by 1000/1 over past results while other benchmarks stay the same.

However, what I'm proposing is along Peter's suggestions and I've been working on two of them for a couple of years now that will allow anyone to use anything in the BRMS world to solve the problem; interpreted rules, compiled rules, decision tables, decision trees - don't care; just so long as the answer is correct and the outputs are done. This is not a "true" benchmark since it does not require the engine to do a certain amount of work but it is more like a Derby.

The NPC benchmark will be engine intensive, it will require lots and lots of processor power. I haven't started it yet so I'm open to suggestions. Wikipedia list about 20 of them at but I'm sure that there are more of them somewhere in the world.

BTW, what were the "cons" of using NP-Complete benchmark? I guess I don't really see it except that someone will have to be an independent "judge and jury" on the problem.


Johan Lindberg said...

Hi again,

BTW, what were the "cons" of using NP-Complete benchmark?

Well. Here are the ones I thought of when I made my comment.

1) If the solver is going to be, at least, somewhat efficient you're going to have to produce quite involved code so I think it's going to be difficult to write, to get right and to read (for the rest of us).

2) I think that the solution *could* look quite different on different engines which might cause proponents of an engine to argue that you should write it this or that way using this or that specific feature making it more difficult to make fair comparisons and/or get accepted as a fair benchmark.

3) I don't think you'll be able to vary the data size much. Which might otherwise be a simple way to make it harder to solve (since NP-complete problems tend to go from *real* hard to *impossible* with a very small increase in the data set).

4) You're going to have to choose an algorithm and a heuristic for guiding the program through the solution space and unless that algorithm and heuristic is well known, thouroughly studied and understood you might well end up with cheaters anyway (just in a different way than with Ms Manners).

All of the above is, of course, mostly, speculation on my side.

I've attempted to solve the Sling Blade Runner problem with CLIPS but gave up before I managed to get a "good" solver. Sling Blade Runner is an instance of "find the longest (non-cyclic) path through an unrooted graph", which is NP-complete.

The code got very complex quite quickly and my experience in trying to get a CLIPS solver is a large part of my comment.

But there might be better (or worse) instances on NP-complete problems where most of these "cons" are minimized or where they even go away.

BTW. There's a longer list of NP-complete problems here.

Gary Riley said...

My sudoku benchmark is here: Key benefits are:

It's designed to fire the same number of rules regardless of the conflict resolution strategy for any engine that supports salience/priority values.

Solutions are unique and easily verified.

There are test cases for each of the 18 supported solution techniques. The design also allows the benchmark to be incrementally developed and tested as the more complex solution techniques are not employed until the easier ones fail.

Although it's a bit more complex than manners, the problem domain is easily understood and there's a lot of information online describing the various solution techniques used.

Like manners and waltz, the amount of data can be scaled by selecting the size of the puzzle (3x3 x 3x3, 4x4 x 4x4, ...).

It can be run in non-stressed and stressed modes to demonstrate how implementation techniques can effect performance.

So far I've got it working with CLIPS, Jess, and JRules.

I've also have a modified version for CLIPS that demonstrates how performance is affected by using modules rather than the "phase" facts utilized by the manners and waltz.

Gary Riley said...
This comment has been removed by the author.
Daniel Selman said...


I'm going to side with Peter on this one - I think we need a benchmark aligned with typical customer use cases rather than something like Suduko.

For example, at ILOG we wouldn't use JRules to solve a problem like Suduko but rather something much more adapted to NP hard/complete search problems, like ILOG CPLEX or CP:

These products have much more sophisticated search strategies out of the box than a generic inference engine like JRules.

I blogged a few thoughts on this at:

We should be helping people pick the right tool for the job - not all problems are best solved using an interference engine such as JRules.


woolfel said...

Although I like NP benchmark problems, for the typical developer who has to build something quickly, it doesn't help. the real benefit of having "realistic" business scenarios, is it gives the developers an idea of how to implement a solution and what the trade offs are. so a business benchmark is useful as a learning tool and not just performance measurement.

benchmarks that stress the engine are great for exposing inefficient code or design issues, but it's really hard for beginners to grasp.

Gary Riley said...

I agree that the BRMS folks need to develop some reasonable benchmark programs, but there are those of us who are still interested in benchmark programs for rete engines used for general purpose programming. Talking about how crappy manners and waltz are as benchmark programs, but then publishing performance numbers anyway doesn't really help the situation.

The fact that the inappropriateness of these benchmarks has been discussed ad nauseum for years and yet we currently don't have any better benchmarks is an indication that we can't be particularly picky about which ones we use. It would be nice if vendors and companies made 'real world' applications available to the public or there was a cadre of programming wonks developing realistic example programs that could be utilized as benchmarks, but that hasn't happened.

Even if we did have some 'real world' applications, these wouldn't necessarily make the best benchmarks. Real world applications tend to take advantage of the features of the tool they are built with, which makes porting them to other tools difficult as well as verifying that they are behaving correctly.

Beggars can't be choosers. If we had a surplus of benchmarks we could take the best and drop the rest, but right now we're not in that situation. Having additional benchmarks, even if they are not perfect, would blunt some of the emphasis that has been placed on manners and waltz.

James Owen said...

I "had" promised quite some time ago to have two "real world" benchmarks, one small and one large, ready "soon" where "soon" became defined as "when I get some time and feel so inspired" - the small one is almost ready. Lots of data and taken from a data scrubbing problem that I had some time back. With the client's permission, of course.

The small one isn't too complicated - but it was the benchmark that we used to 'weed out' vendors who could not demonstrate the power to run a certain number of rows per second. The bottom acceptable was 250 (later scaled down to 220) rows of data per second because the company expected 8M rows of data each night that had to be properly scrubbed and sent off for report generation by 8:00 each morning.

Personally, I don't like the idea of using a rulebase for this BUT it was important since the business guys wanted to be able to change the rules at any time if they found a flaw and they didn't want to wait for three months - they were losing $4M each month that they were out of compliance and this allowed them to trim the charges down to "only" $100K each month.

Anyway, I'll certainly get that one out before the October Rules Fest, now being called ORF. Also, I would like to get an NP-complete example out and running. Mostly because it will show that not everyone can even DO an NP-complete problem.

More on ORF next week. Right now, without a central air conditioner here (we still have just the two window units) it gets really warm in the office.

Thanks for the comments.

Johan Lindberg said...


Is the modified version for CLIPS that demonstrates how performance is affected by using modules rather than the "phase" facts utilized by the manners and waltz available in the SVN repository as well?

It would be very interesting to see those results.

woolfel said...

I'm been meaning to write a set of benchmarks, but been waiting for PR Ruleml. Too bad it's taking so long to get PR Ruleml out.

over the next few weeks, I'll try to write up a few of the scenarios I'm thinking of and post them on jamocha SVN. it will be in clips format.

snshor said...

I agree with Daniel, that NP problems are better suited for CP solvers, both in performance and expressiveness terms.


have you posted the actual results somewhere? Or one has to run it to get the performance numbers? I could not find it

Gary Riley said...

Here are the results of an earlier version of the sudoku benchmark: