Tuesday, February 4, 2014

Benchmarks: Who, What, Where When and Why?


A BRMS (Business Rule Management System) is, after all, a rulebased system that has evolved into what we now refer to as a Decision Manager or Business Decision Manager when applied to business systems.  Over the years we have tried to establish a set of benchmarks that will allow the users to test various systems for speed and efficiency on different, complex problems.  The two most famous tests are the Miss Manners test and the Waltz benchmark.  The original five OPS (Official Production System) benchmarks can be found at ftp://ftp.cs.utexas.edu/pub/ops5-benchmark-suite/ and will run on most any platform.  All major BRMS vendors have written the code for their particular language syntax for Miss Manners, Waltz and/or WaltzDB.  In addition, Dr. Forgy and I have written the NP Hard benchmarks for several systems.

The Miss Manners OPS benchmark originated about 15 years ago with OPS5 and CLIPS languages.  It’s a relatively simple rulebase with only eight rules that will do a depth-first search to find a solution and the program comes with a data generator.  The idea is that Miss Manners has invited 16, 32, 64 or 128 guests with various hobbies to a dinner party.  She want to seat the guests in boy-girl-boy-girl arrangement so that each guest will have someone on the left or right that has a common hobby.  Back in 1979 the Manners 128 program took 5,838.5 cpu seconds to run on a SPARC 1+.  Today, due to massive improvements in hardware, this one runs in about 1.5 seconds.  The data for Miss Manners should have X number of guests, 2 or 3 hobbies from a possibility of 5 hobbies and the guests must be equal number of male and female to be seated M-F-M-F etc.

It becomes more complex with the number of hobbies and number of guests.  The original test was written to really “stress” any rulebase but some vendors found the trick of putting a single “not” statement in one of the rules that would make it run 15 or 20 times faster.  Without that “trick” it’s a very good measure of how fast a system will run on any give platform and CPU.  Another “trick” is to re-arrange the data so that the rules will run faster because the benchmark is data-sensitive.

The Waltz OPS benchmark is another oldie but a goodie that will really stress a rulebase system because it checks to see how well the rulebase does pattern matching.  Consisting of 32 rules, it will analyze the lines of a two-dimensional drawing and label them as if they were edges in a three-dimensional object.  The Waltz benchmark also comes with a data-generator and is much harder to cheat with than the Miss Manners benchmark.  Waltz comes with a C program for generating data for any number of regions; 12, 25, 37, 50 or even 200.  UT maintains the object C files for convenience at /ops5c/lib/libops5c.a and a math library that is used with the benchmarks.  The SPARC 1+ time for Waltz-50 was 3,831.8 cpu seconds.  Today’s benchmarks are between 0.2 to 1.9 seconds at worst.

The WaltzDB OPS benchmark that, like Waltz, labels the lines in a 2D drawing in order to assign configure a 3D object.  The change is that WaltzDB can handle drawings with junctions of four or six lines while Waltz does junctions of only two or three.  WaltzDB only has 35 rules but its data sets have many, many more junctions.  WaltzDB also has its own data generator, waltzydb.c, and it also needs to be compiled.  I have run tests using 4, 8, 12, 16 and even 200 regions.  The WaltzDB 200 is the most difficult of all.  16 regions on the old SPARC 1+ took 8,033.3 cpu seconds but today it takes about 0.5 seconds.  The WaltzDB 200 takes only about 10 – 15 seconds on most systems but can take 2 seconds or less when running Rete-NT, the latest incarnation of the Rete Algorithm.

The A.R.P. OPS benchmark is program is an Aeronautical Route Planner that will plot a course across a given territory from P1 to P2 for a airplane or CRUISE missile.  There is a dataset generator that asks about 40 questions and generates a file called rav-sceneXxYxZ.dat where X, Y and Z are the 3D coordinates from the input data.  There is a sample list of questions in the README file at UT.  This benchmark is unique in that there are two files that have to be loaded, “filename”.dat and “arp-rp-makes”.  The best time for the A.R.P. benchmark on a SPARC 1+ with 10x20x30 data set was 1,220.2 cpu seconds.

The Weaver OPS benchmark is a combination of several expert systems that communicate through a common blackboard, or maybe a whiteboard today.  The “practical” application for this system is to design a VLSI (Very Large Scale Integrated) chip design, something that a chip manufacturer such as AMD or Intel face every day.  Far more detail is provided in the README file.  The best time for the Weaver benchmark on a SPARC 1+ was 1,053.7 cpu seconds.

The next two benchmarks come from a series of benchmarks known as “NP complete” benchmarks where NP stands for Non-deterministic Polynomial-time.  We have started using these this year, (1Q2014) since we have found that Manners and/or Waltz to be either 1) easy to cheat or 2) that the benchmark fires only one or two rules over and over.  Manners is guilty on both accounts.  So, this year we have include both the Clique Problem and the Vertex Cover Problem for starters.  Later we can expand this to other NP Complete problems. 

Either of these problems can be converted to Java or C syntax but, for starters, I plan on implementing these in Drools, Jess, CLIPS, Smarts, ODM and Blaze Advisor.  That should be enough for comparisons for this year.  Dr. Forgy has been kind enough to have already provided the code for these two NP-Complete problems in OPS syntax that we should be able to convert to Java, C/C++ or C#.   Or BASIC for that matter.

That pretty much covers the 5-W's of good journalistic articles.  The HOW (5W+H) part is not the most difficult.  Our plans are to provide the NP-Complete benchmarks, along with the first three UT benchmarks, for our talk at Decision Camp 2014 to be held in San Jose this November.  I hope to see many of you there since registration is, again, free thanks to Sparkling Logic and eBay.



Anonymous said...

Hard to guess why you are Yakov for this post and James for others.

James Owen said...

To Anonymous: My name in Hebrew is Yaakov which is Jacob in German or James in English. If you refer to a Hebrew New Testament you will find the book of James is the book of Yaakov. :-) So, sometimes, James, sometimes Yaakov. My other blog is always signed Yaakov.

Jacob Feldman said...


I believe that after so many years the benchmarks should be modified to stay closer to the decision management reality of these days. Most people today are less interested in rule engines performance but rather in an ability to represent and manage complex rules with exceptions and conflicts. For example, the basic Miss Manner could be modified to consider the situations when data sets do not provide enough people to satisfy gender and/or hobby requirements. Would a rule engine fail or would it offer a solution that minimizes rules violations? I did a quite detailed analysis of this problem in my RulesFest-2009 presentation. BTW, the RulesFest that time was nicely chaired by James Owen. Good luck,

Jacob Feldman, jacobfeldman@openrules.com

James Owen said...

@Jacob: I understand the resistance to older benchmarks, such as Miss Manners. However, this year we are considering only three of the original benchmarks: Waltz50, WaltzDB-16 and WaltzDB-200. In addition we are evaluating Cliques and Vertex Cover, both from the NP-Complete problems found at http://en.wikipedia.org/wiki/List_of_NP-complete_problems .

What is difficult is to get any of the benchmarks done for BRMS other than the OPS-type, such as Drools, OPSJ, JRules/ODM, Sparkling Logic, Jess, CLIPS, Blaze Advisor, etc. Visual Rules, VisiRules and Open Rules and others of this type seem to have problems implementing these benchmarks properly. Also, at this point, none of the vendors care enough to help with implementing any of the benchmarks with their particular machine.

I am encouraging everyone to be at Decision Camp 2014 this year so that we can have a meaningful discussion on javascript:void(0)benchmarks.


Gary Riley said...

Here's something to think about after the problems you had validating the results of the waltz db benchmark: http://rileyonlife.blogspot.com/2014/05/river-benchmark.html. You can use the random conflict resolution strategy in CLIPS to see if there are order dependencies between rules of the same priority.

zachary2232 said...

Will you be posting any of the benchmark results of the different rules systems from the decision camp?

James Owen said...

@Gary: The whole purpose of using different conflict resolution strategies is that one works best with one set of problems and another works best with another set of problems, or rules. When MEA works one way, Depth or Breadth might work another and, naturally, each one would fire (normally) a different number of rules depending on the order in which they were evaluated. Personally, I really do not like priorities, or salience, since that seems to be a "forcing function" on the rules.

The main purpose of Declarative Programming is that the rules should have HAVE to have a certain order nor salience. However, initialization rule(s) should be high and output rules should be low or minus. That said, a rulebase should not have more than three levels of salience. So saith all of the "older" theories of programming a rulebase.

James Owen said...

@zachary2232: I will send you whatever you want via emails but I probably will NOT publish them on my blog. The reason is that most are not complete and sometimes cannot be duplicated on other machines EXACTLY so I get arguments. I use my machines and publish the results that depend on number of CPUs, what kind of CPUx, amount of RAM, what kind of HD and, usually, the exact command line used for each machine and setup to minimize the time spent. Each run can be different so I usually run each one about 10 times to get an average.

James Owen said...

I said:
The main purpose of Declarative Programming is that the rules should have HAVE to have a certain order nor salience.
I SHOULD have said:
The main purpose of Declarative Programming is that the rules should NOT have to have a certain order nor salience.