Saturday, November 3, 2007

Benchmarks - Final Bleat

OK, here's my final bleat on rulebase engine benchmarks. (Right - YOU believe me, don't you?) Anyway, benchmarks cannot, should not, be done by the vendor but rather by some independent party; a firm that does not have any vested interest in any of the outcomes other than the truth and fairness of play. If a vendor produces the benchmark, then the vendor would have to allow anyone (and by that I mean someone outside of the company, including the competitors of said vendor) to double check those figues, even if it means supplying a time-bombed version of the software to the company or person double checking the facts.

If the benchmarks are produced by a company other than the vendor, the the company should be totally independent of that vendor, someone at "arms length." This means that a company that is a "partner" of the vendor could not produce those benchmarks. However, if the company were a company that simply "used" those products in the normal performance of the business of the company then that would not only be OK but preferred. (Someone like a consulting company that is NOT a partner with any of the rulebase vendors, for example.) However, a company who did NOT use a rulebase in the pursuit of its normal business normally would not be eligible either since that company probably would not have the expertise on-board to discriminate between tests and understand if and when one vendor tries to "cheat" on the tests. And vendors do cheat.

On the subject of cheating, all vendors should be held accountable to the same standards. For example, if one vendor is allowed to use compiled code (in an form except compiling the Java or C or C++ classes themselves) then the other vendors should be allowed to do the same thing. An example of this is the compiling of the rules into Java that is run with a JIT compiler. OPSJ has ONLY this method of running the OPSJ rules. With JRules, Blaze Advisor and others it is an option. Drools and Jess can be run in interpreted mode only and so have put themselves in an unfair position. Corticon and others are NOT based on the OPS format of most benchark rules and they, also, sometimes have an advantage and sometimes a disadvantage. The comany doing the benchmarks has to do everythingl possible to make sure that the playing field is level and, where it isn't, point that out to the readers. (Ergo, the reason for the pure independence of the benchmarking company.)

Also, the company producing the benchmarks should not only make the code and data readily available, but should also produce meaningful benchmarks; meaning that Miss Manners has about run its course and probably should be dropped from contention. It's nice to see the results, but the test itself is way to easy to cheat, whether in the code itself or "under the covers" with optimization aimed strictly at that benchmark. (Yes, some companies have been doing that for years and we, the poor schmucks in the industry, have just now caught on.)

OK, let's assume that my company (KBSC) ran the benchmarks. We're independent. We aren't partners with anyone, much to our financial loss. We do NOT produce any kind of variant of the rulebase in questions. We showed our code. What else can we do to ensure the integrity of our results. Here are some vairants that should be posted with the benchmark:

Machine Name
Machine Model
Machine Speed
Machine RAM Total
Machine RAM available (did something else have the machine busy at the time?)
The command line used to run the tests
(the -Xmx and -Xms and -server mean something to the machine)
Operating System and version
Java, C++ or C# version used
Number of "warm up" cycles
J2EE or EJB used
(if so, which one and which version and show setup)

There are lots of variables in running a benchmark. Everyone must be shown because one vendor runs better on one OS and another vendor runs better on another OS. The same thing applies for the Java version used, J2EE, etc. Unfortunately, since I (personally) left the independent arena, lots of folks have jumped into the gap but few can step up to the plate and meet all of those requirements on all of the benchmarks that they produce. (OK, Steve Nunez is trying really hard but Illation is, after all, not only a partner with ILOG but produces their own knowledge software as well.)

How do we fix the independent benchmarking problem? Well, we need an independent agancy. And it must be like UL or ACORD (the de facto global e-commerce standards body for insurance) but designed for testing a rulebase and evaluating a BRMS; and we need it now! So, here's a thought: What IF we (the industry) did the following? We, the industry and customers, form an independent laboratory. Companies (vendor and/or customers) with 1,000 or more employees would contribute $50K annually toward such an organization. Companies with 100 - 999 would contribute $20K annually. Individuals would contribute $100 per year for membership. This would entitle them to the results of any and all tests being run, input to the company for fairness (but they would have to agree that the sole arbiter of any dispute would be the benchmarking company) and monthly reports on what's happening. Stockholders, if you will, but non-voting stockholders.

And the company itself would deal with anything to do with BRMS, including evaluation of products, benchmarks, an annual ranking of products based on various criteria such as speed, integration with other products, technical support, initial costs, professional services effectivenss and their costs and anything else that might impact the customer. Vendors, of course, would be expected to contribute on the same scale as those companies subscribing to the service. (OK, I took a page out of Forrester's book!!)

[Personal grousing time] The reason that I had to leave the pure independent route was that I financially could not survive and do pure research - and nobody in the BRMS market would step up to the plate and provide the wherewithal to be independent and still survive. I even tried (in cahoots with InfoWorld) to form a company that would do what I described above but IW and the reset of the BRMS vendors couldn't see an advantage for them nor the readers. Well, maybe the readers of this blog can see what such an outfit could do to help the poor sod customers by showing them who's who and what's what BEFORE they try to do it themselves.

SDG
jco

6 comments:

snshor said...
This comment has been removed by the author.
snshor said...

Very good post. Could not agree with one thing though: if one vendor provides JIT and another not - it is not an unfair advantage, if the mode is "transparent" - i.e. does not require extra programming or change in the logic of an application. Users, in general, do not care if their application is "fairly" or "unfairly" fast

James Owen said...

Regarding the snshor post on 4 Nov 07

1) Hmm... That was my point - if that is the ONLY way that the vendor can do the benchmarks, then it's up to the testing company to "level the playing field." Which is why I usually showed several different methods for each product. When attempting to compare Corticon to most other vendors, it was virutally impossible to do because the tests were written for Rete-OPS type engines, not for spreadsheet-type engines. We came close but the bottom line is that they ARE different and I should not have alllowed them to do Miss Manners the way that they did UNLESS I had allowed all of he others to change their rules as well. And Corticon STILL didn't like what I did.

2) Users are concerned only that their application is fast - fair or unfair. BUT, and this is the main point, the test have to be fair for the user to make an intelligent selection on comparative performance of the various vendors.

snshor said...

Probably, I was not clear - I meant Rete-OPS compatible engines only. Sure, Corticon is different, RuleBurst is different some other products (including the one I work on - plug-alert!) are different; even Rete-OPS engines are not 100% compatible in terms of syntax, semantics, order of execution etc. But, at least they can run (in theory) on "the same" ruleset. In this situation, a vendor having JIT or other optimizations should not be handicapped because others did not bother to implement it.

Situation with Corticon and other DIFFERENT engines is more complicated. For some reasons the industry have been able to convince the public that rule engine == Rete engine. This is a completely different topic and it is up to non-Rete vendors to convince users that this is only partially true. Fortunately, Rete vendors are doing it for us, rushing to include sequential modes, ruleflows, decision tables and trees into their products...

snshor said...

For such variety of products there should be a better way of comparing performance and other features. For example, for each test problem simple Java API is provided; for manners =

MannersEngine createEngine();
class MannersEngine
{
public Guest[] solveManners(Guest[] guests);
}



Every vendor supplies their solution, testing robot tests and produces performance results. The advantage of this approach is that this way, solutions may be automatically tested not only for performance, but for scalability and thread-safety.

Then testing authority decides, if solution was made using only standard product features(simple transformations do not count) and gives additional marks for "elegance" and "simplicity"

This way very different products still may compete

Anonymous said...

James,

First, thank you for nurturing these benchmarks for so long. I believe you've provided a real service to the community.

At ILOG we have decided to take the route of publishing our benchmarks and making everything publicly available so anyone can verify the results in their own environment.

For example:
http://blogs.ilog.com/brms/2007/10/22/academic-benchmark-performance/

And the ensuing discussion:
http://forums.ilog.com/index.php?topic=53.0

We strive for total transparency, and we hope that this approach will weed out the many environmental factors that can influence benchmark results. The bottom line is that people trust a benchmark they can run themselves, on their own machines.

I aim to blog about the optimization options for the ILOG rule engines, to elaborate on some of our tuning capabilities and how they can impact performance.

Sincerely,
Daniel Selman