A BRMS (Business Rule Management System) is, after all, a rulebased system that has evolved into what we now refer to as a Decision Manager or Business Decision Manager when applied to business systems. Over the years we have tried to establish a set of benchmarks that will allow the users to test various systems for speed and efficiency on different, complex problems. The two most famous tests are the Miss Manners test and the Waltz benchmark. The original five OPS (Official Production System) benchmarks can be found at ftp://ftp.cs.utexas.edu/pub/ops5-benchmark-suite/ and will run on most any platform. All major BRMS vendors have written the code for their particular language syntax for Miss Manners, Waltz and/or WaltzDB. In addition, Dr. Forgy and I have written the NP Hard benchmarks for several systems.
The Miss Manners OPS benchmark originated about 15
years ago with OPS5 and CLIPS languages.
It’s a relatively simple rulebase with only eight rules that will do a
depth-first search to find a solution and the program comes with a data
generator. The idea is that Miss Manners
has invited 16, 32, 64 or 128 guests with various hobbies to a dinner
party. She want to seat the guests in
boy-girl-boy-girl arrangement so that each guest will have someone on the left
or right that has a common hobby. Back
in 1979 the Manners 128 program took 5,838.5 cpu seconds to run on a SPARC
1+. Today, due to massive improvements
in hardware, this one runs in about 1.5 seconds. The data for Miss Manners should have X
number of guests, 2 or 3 hobbies from a possibility of 5 hobbies and the guests
must be equal number of male and female to be seated M-F-M-F etc.
It becomes more complex with the number of hobbies and
number of guests. The original test was
written to really “stress” any rulebase but some vendors found the trick of putting
a single “not” statement in one of the rules that would make it run 15 or 20
times faster. Without that “trick” it’s
a very good measure of how fast a system will run on any give platform and CPU. Another “trick” is to re-arrange the data so
that the rules will run faster because the benchmark is data-sensitive.
The
Waltz OPS benchmark is another oldie but a goodie that will really stress a
rulebase system because it checks to see how well the rulebase does pattern
matching. Consisting of 32 rules, it
will analyze the lines of a two-dimensional drawing and label them as if they
were edges in a three-dimensional object.
The Waltz benchmark also comes with a data-generator and is much harder
to cheat with than the Miss Manners benchmark.
Waltz comes with a C program for generating data for any number of
regions; 12, 25, 37, 50 or even 200. UT
maintains the object C files for convenience at /ops5c/lib/libops5c.a and a
math library that is used with the benchmarks.
The SPARC 1+ time for Waltz-50 was 3,831.8 cpu seconds. Today’s benchmarks are between 0.2 to 1.9
seconds at worst.
The WaltzDB OPS benchmark that, like Waltz, labels the lines in a 2D drawing in
order to assign configure a 3D object.
The change is that WaltzDB can handle drawings with junctions of four or
six lines while Waltz does junctions of only two or three. WaltzDB only has 35 rules but its data sets
have many, many more junctions. WaltzDB
also has its own data generator, waltzydb.c, and it also needs to be
compiled. I have run tests using 4, 8,
12, 16 and even 200 regions. The WaltzDB
200 is the most difficult of all. 16
regions on the old SPARC 1+ took 8,033.3 cpu seconds but today it takes about
0.5 seconds. The WaltzDB 200 takes only
about 10 – 15 seconds on most systems but can take 2 seconds or less when
running Rete-NT, the latest incarnation of the Rete Algorithm.
The A.R.P. OPS benchmark is program is
an Aeronautical Route Planner that will plot a course across a given territory
from P1 to P2 for a airplane or CRUISE missile.
There is a dataset generator that asks about 40 questions and generates
a file called rav-sceneXxYxZ.dat where X, Y and Z are the 3D coordinates from
the input data. There is a sample list
of questions in the README file at UT.
This benchmark is unique in that there are two files that have to be
loaded, “filename”.dat and “arp-rp-makes”.
The best time for the A.R.P. benchmark on a SPARC 1+ with 10x20x30 data
set was 1,220.2 cpu seconds.
The Weaver OPS benchmark is a
combination of several expert systems that communicate through a common
blackboard, or maybe a whiteboard today.
The “practical” application for this system is to design a VLSI (Very
Large Scale Integrated) chip design, something that a chip manufacturer such as
AMD or Intel face every day. Far more
detail is provided in the README file.
The best time for the Weaver benchmark on a SPARC 1+ was 1,053.7 cpu
seconds.
The
next two benchmarks come from a series of benchmarks known as “NP complete” benchmarks
where NP stands for Non-deterministic
Polynomial-time. We have started using these this year, (1Q2014) since we
have found that Manners and/or Waltz to be either 1) easy to cheat or 2) that
the benchmark fires only one or two rules over and over. Manners is guilty on both accounts. So, this year we have include both the Clique Problem and the Vertex Cover Problem for
starters. Later we can expand this to
other NP
Complete problems.
Either
of these problems can be converted to Java or C syntax but, for starters, I plan on
implementing these in Drools, Jess, CLIPS, Smarts, ODM and Blaze Advisor. That should be enough for comparisons for
this year. Dr. Forgy has been kind
enough to have already provided the code for these two NP-Complete problems in OPS syntax that we should
be able to convert to Java, C/C++ or C#. Or BASIC for that matter.
That pretty much covers the 5-W's of good journalistic articles. The HOW (5W+H) part is not the most difficult. Our plans are to provide the NP-Complete benchmarks, along with the first three UT
benchmarks, for our talk at Decision
Camp 2014 to be held in San Jose this November. I hope to see many of you there since
registration is, again, free thanks to Sparkling Logic and eBay.
Shalom
Yaakov
9 comments:
Hard to guess why you are Yakov for this post and James for others.
To Anonymous: My name in Hebrew is Yaakov which is Jacob in German or James in English. If you refer to a Hebrew New Testament you will find the book of James is the book of Yaakov. :-) So, sometimes, James, sometimes Yaakov. My other blog is always signed Yaakov.
James,
I believe that after so many years the benchmarks should be modified to stay closer to the decision management reality of these days. Most people today are less interested in rule engines performance but rather in an ability to represent and manage complex rules with exceptions and conflicts. For example, the basic Miss Manner could be modified to consider the situations when data sets do not provide enough people to satisfy gender and/or hobby requirements. Would a rule engine fail or would it offer a solution that minimizes rules violations? I did a quite detailed analysis of this problem in my RulesFest-2009 presentation. BTW, the RulesFest that time was nicely chaired by James Owen. Good luck,
Jacob Feldman, jacobfeldman@openrules.com
@Jacob: I understand the resistance to older benchmarks, such as Miss Manners. However, this year we are considering only three of the original benchmarks: Waltz50, WaltzDB-16 and WaltzDB-200. In addition we are evaluating Cliques and Vertex Cover, both from the NP-Complete problems found at http://en.wikipedia.org/wiki/List_of_NP-complete_problems .
What is difficult is to get any of the benchmarks done for BRMS other than the OPS-type, such as Drools, OPSJ, JRules/ODM, Sparkling Logic, Jess, CLIPS, Blaze Advisor, etc. Visual Rules, VisiRules and Open Rules and others of this type seem to have problems implementing these benchmarks properly. Also, at this point, none of the vendors care enough to help with implementing any of the benchmarks with their particular machine.
I am encouraging everyone to be at Decision Camp 2014 this year so that we can have a meaningful discussion on javascript:void(0)benchmarks.
James
Here's something to think about after the problems you had validating the results of the waltz db benchmark: http://rileyonlife.blogspot.com/2014/05/river-benchmark.html. You can use the random conflict resolution strategy in CLIPS to see if there are order dependencies between rules of the same priority.
Will you be posting any of the benchmark results of the different rules systems from the decision camp?
@Gary: The whole purpose of using different conflict resolution strategies is that one works best with one set of problems and another works best with another set of problems, or rules. When MEA works one way, Depth or Breadth might work another and, naturally, each one would fire (normally) a different number of rules depending on the order in which they were evaluated. Personally, I really do not like priorities, or salience, since that seems to be a "forcing function" on the rules.
The main purpose of Declarative Programming is that the rules should have HAVE to have a certain order nor salience. However, initialization rule(s) should be high and output rules should be low or minus. That said, a rulebase should not have more than three levels of salience. So saith all of the "older" theories of programming a rulebase.
@zachary2232: I will send you whatever you want via emails but I probably will NOT publish them on my blog. The reason is that most are not complete and sometimes cannot be duplicated on other machines EXACTLY so I get arguments. I use my machines and publish the results that depend on number of CPUs, what kind of CPUx, amount of RAM, what kind of HD and, usually, the exact command line used for each machine and setup to minimize the time spent. Each run can be different so I usually run each one about 10 times to get an average.
I said:
The main purpose of Declarative Programming is that the rules should have HAVE to have a certain order nor salience.
I SHOULD have said:
The main purpose of Declarative Programming is that the rules should NOT have to have a certain order nor salience.
Post a Comment