Thursday, June 18, 2009

More on Benchmarks for 2009

Greetings:

Just another post to TRY and get some discussion started on benchmarks. All I seem to get from the vendors is, "Those old things? Those are not 'real-world' benchmarks!" OK, no argument here... So, does anyone have a "real-world" benchmark? No? Fine, then we will use what we have and hope for the best since they seem to work fine on checking the rule processing performance (using almost "real-world" rules) on most engines.

Checking performance on a modeling tool, such as Corticon, Visual Rules, Rule Burst, VisiRules, any Decision Table or Decision Tree, is absolutely NOT what checking rule processing power is all about. [I'll probably catch some flak on that remark but those just my personal feelings.] If you already have the rules hard-coded using either straight-up Java or some kind of modeling tool that produces Java code (such as sequential rules or something else along those lines) then that SHOULD run faster than a real rulebase engine that is designed from the ground up to be an inferencing engine based on whatever algorithm you like.

However, I would like this year to allow a couple of things that I have not done in the past: (1) Allow a "warm-up" time of maybe three or four code passes through all of the rules using different date each time and then running the rules for 10 consecutive passes using 10 different sets of data and taking the average time for the benchmark time. In years past, rules did not run under EJB/J2EE or similar environments (we had Java for several years before we had J2EE/EJB) and we did not allow such things. However, with the increased overhead of having to have that in the core part of the engine I think that it should be allowed. (2) I'm going to drop the old version of Miss Manners 8, 16, 32, 64, 128 and 256 and substitute Miss Manners 2009 - which is the ORF example for this year. (3) The other two benchmarks from the old days are still good, Waltz-50 and WaltzDB-16. (4) However, we are introducing a new WaltzDB-200 this year just to really get some long lead times. (5) We will run these all on the following systems

a1. Mac Core2Duo, 3GB RAM, OS X Leopard, 64-bit (which is Free BSD Unix with a pretty face)
a2. Mac Dual-Quad Core, 8GB RAM, OS X White Leopard, 64-bit [maybe...]
b. HP Intel, 3GB RAM, Dual Threaded, Windows XP, 32-bit
c. Dell Intel i7, 4-core, 8-threads, 6GB RAM, Windows Vitria 64-bit

I might try to work in some Linux if there seems to be any significant speed differentiation on an Intel running Linux or Windows - but experience teaches that usually the Windows version runs faster. But I will check it anyway just to be sure. (6) The systems that I am hoping to check will be (alphabetical order)

a. Blaze Advisor 6.7
b. Drools Version 5.x
c. CLIPS Version 3.0b
d. Jess 7.0-p2
e. JRules Version 6.7.3 or 7.x (depends...)
f. OPSJ Version 6.0
g. OPSJ-NT Version 1.0

Probably, I will publish the results here, along with the previous years of Performance benchmarks, as well as on the KBSC home page. The comparisons of 32-bit and 64-bit should tell us something about scalability. The comparisons of different OS should tell us something about scalability and transportability.

One more thing: If any of the other vendors can demonstrate a suitable version of the benchmarks I will include them - but NOT the same thing that I did a few years ago when I allowed a "similar" version of the benchmarks to be used by a vendor that could not code straight-up IF - THEN - ELSE rules using a NOT statement in there somewhere.

I do expect cheating on the part of the vendors. Somehow, I must find a benchmark somewhere that will not allow that so I'll probably throw in a one that has lots of NOT statements in it or something really rude like that. I know that the vendors don't really pay attention to benchmarks any more so I'm hoping that the customers of these and other vendors will stress performance benchmarks to their suppliers as another check of good engineering. Layering GUI after GUI after Model after Model is cool EXCEPT when you forget how to perform under the pressure of millions of transactions per day that need complex, real rulebase-type analysis.

SDG
jco

7 comments:

Charles Young said...

Here are a few thoughts.

Regarding 'warm-up' time, three or four passes is nowhere near enough. Heap fragmentation is a very significant performance factor, even when the engine exploits a compacting garbage collector (remember that a garbage collector will try to minimise the number of collections it does). It can take hours of running before these effects stabilise fully, although after a couple of thousand invocations I generally find that the performance graph has just about flat-lined.

I favour doing benchmarking of this sort at two extremes - single hit, which demonstrates how poor the performance is when a rules engine is used just once within a process, and averaging across multiple invocations of the same rule set after the engine has been continuously executing that rule set for a few hours.

I would hope that you would many different rule sets, and not just Waltz. The historic focus on just one or two is a major problem. Each rete produced by the engines you list has its own specific performance characteristics, and these differ widely for different rule sets. Each rete has different big-O asymptotic time and space complexity and exercises the features of each engine in different ways. Building a more realistic picture of true performance comparison between different engines requires using lots of different benchmarks. Remember that the 'Rete wall', beloved by at least two different vendors, is really an algorithmic feature of one specific benchmark (Miss Manners). Different rule sets will render different characteristic curves.

Why do we never measure space complexity (memory usage)? Why are we only ever concerned with time? I realise memory measurements can be difficult and misleading, especially when automatic garbage collection is used.

Who checks the implementations of the same rule set on different engines to ensure they perform a fair comparison? Miss Manners is the locus of the more extreme examples of how vendors can cheat by performing totally bogus comparisons. Remember that awful canned video Hayley had on their web site where they compared an efficient rule set running on their engine against Miss Manners on a competitor's engine. Miss Manners was deliberately written to be as inefficient as possible. Hayley failed to point that out, and used their results to suggest that their engine could calculate the same result (seating arrangements of guests) thousands of times faster than their competitor's engine - which of course it can if you compare entirely different rule sets on the two engines, and one of the rule sets is designed deliberately to be highly inefficient while the other is efficient"

In reality, ensuring fair comparison for benchmark rule sets is very complicated. Performance can turn on the smallest differences in implementation, and can be deeply affected by specific side effects of different engines. Each engine exhibits different side effects, and so a specific implementation that suffers or benefits from some specific side effect doesn't, by itself, tell you much about real-world comparative performance. Other rule sets may benefit or suffer in different ways from different side effects on the same or different engines. The only way forward is to build up a general picture by performing lots of different benchmark tests in a fair way.

James Owen said...

Charles:

I did a "warm up" comparison for JRules quite some time ago - that is when it became evident that we needed such a thing. Even OPSJ performs better after one or two warm-up cycles. When running benchmarks like ours, "my experience" (don't you just love that phrase?) has been that the warm-ups optimize somewhere between four and seven, depending on the engine, parameters, etc. Fine tuning a rulebase is not an easy task.

Perhaps I will have time for the Single Hit and then the Ten Warmup pass as well. Remember, these are NOT what the vendors call "real world" benchmarks - but they have not provided any "real world" benchmarks except for running through a spreadsheet (decision table) application that is definitely not a test of the engine itself.

I'm currently planning on running four benchmarks: Manners2009, Waltz-50, WaltzDB-16 and WaltzDB-200. The first has not been coded yet. The last one has to be written for each vendor and I'm not getting any help at this time. The "Rete Wall" can be avoided fairly easily by using more memory BUT you run up against the infamous Garbage Collector in Java. CLIPS doesn't have this problem so it will probably compare very favorably on the large memory requirements problems.

Miss Manners was originally designed to test the Agenda Table and the memory handling capacity of the engine. It was, indeed, poorly written IF you were trying to optimize the rules. However, as written originally, it was a perfect test. Unfortunately, vendors learned how to defeat the benchmark by coding some "cheats under the covers" which we could not prove. Drools and Jess were the only ones that we could double check for short cuts.

Measuring memory usage is a problem in Java. If you have any suggestions on Java or C/C++ I would be thrilled to hear them and would probably give them a whirl. And, yes, I well remember the Haley marketing pitch, which is why they never would submit their engine for evaluation.

I try as much as possible to get the vendors to participate in this annual event by helping "fine tune" their own engine but only one or two ever respond. OK, three or, sometimes, four.

AND, the best benchmark of all is a fairly large data set of your own application using your own rules and then checking various vendors to see which one performs the best over a long period of time, like several days.

Remember, performance is NOT the only criteria. Sequential rules usually (always?) perform better than an inference engine. But you lose the flexibility of an inference engine. And adding rules to a sequential rulebase can be a nightmare.

SDG
jco

Geoffrey De Smet said...

I think that the performance of the Miss Manners 2009 benchmark is less then 1% about the rule engine performance, but instead, 99% about the optimization algorithm.
If you don't use the exact same optimization algorithm in every implementation, the results is very contestable (IMHO). If you do (and this can be hard due to syntax improvements like exist etc that some engines allow and others do not), the results will be very interesting.

Besides, it's still not proven that a simple (brute force?) optimization algorithm in a rule engine implementation can actually find the solution in our lifetimes, let alone in a few minutes.

My drools-solver solves Miss Manners 2009 in a minute, by combining tabu search (= a non brute force optimization algorithm) with rule engine evaluation. So yes, if the objective is to benchmark the core rule engine, using drools-solver (or other heavy optimization algorithms) is cheating: their results should be separated from the pure rule engine results (from for example drools without drools-solver).

Charles Young said...

The 'Rete Wall', as described and illustrated by FICO and Corticon, is really just the characteristic curve for Miss Manners on a typical Rete engine and is defined by the amount of work the engine is doing as the number of guests increases. It turns out that most of the work is done at just one location in the rete - specifically the last few nodes that feed into the Find_Seating p-node. For typical guest numbers, the curve isn't strongly related to memory per se, and throwing more memory won't necessarily change the shape of the curve, except at extremes. The curve simply reflects the O(nx) complexity of the Miss Manners benchmark on a typical Rete engine and is dominated asymptotically by the number of tokens passing through the last few join and notCE nodes in the find-seating chain (most engines produce a couple of notCE nodes at the end of a chain of beta nodes). This, in turn, is a function of the number of seating, path and chosen facts asserted during forward-chaining. The first of the two notCE nodes (the one that tests seating.id and guest.name against each path fact) nominally performs a little over 200 million evaluations for Manners 128 (as tested on Jess 6 and Drools 3 a few years back). By 'nominal', I mean without consideration of the use of indexes. This gives a rough indication of how intensively this one part of the rete is working. Memory usage is fairly intensive, but any modern computer can comfortably cope with the load, and the combination of depth-first processing and memory indexing means that memory retrieval work is minimised. Intensive lookup against working memories was, I guess, a much bigger problem on older rules engines.

The agenda is an important consideration, of course, but when looking in the whole picture, it tends to account for only a fairly small amount of the total time in Miss Manners. Most modern engines have fast agendas. Some engines even manage to avoid the use of priority queues and the cost of fixing activations up and down heaps. In any case, heapsort is only O(n log n), so becomes increasingly less important as the number of guests increases.

I agree about memory measurement in Java, and don’t have a good answer. The garbage collector masks the underlying space complexity. Leaving lots of junk in memory for as long as the garbage collector reasonably can is generally a good strategy, but doesn't, of course, help minimise the heap fragmentation problem.

I did a whole lot of comparative testing some time ago between a sequential engine and a Rete engine from the same vendor. To my surprise, the Rete engine easily outperformed the sequential engine and was far more scalable. The vendor believes that the next version of their sequential engine, which won't be launched for a little while yet, will significantly improve performance. We shall wait and see.

Geoffrey makes a really good point about the difficulty of doing comparative benchmarking. You have to work really hard at understanding what it is you are truly comparing. That requires an in-depth analysis of different implementations, and often a deep understanding of the specifics of each engine. In my dreams, what I would really like to see is a set of micro-benchmarks that focus on individual capabilities of engines - e.g., a rule set that measures join performance for specific types of evaluation. I've seen the situation where an engine offers more than one way of doing simple equality comparisons, but which performs differently depending on the chosen equality operator - the engine is either faster or slower than another engine depending purely on which operator the developer decides to use. Another example test might be a rule set that measures how well an engine scales when evaluating a large number of rules, each one of which performs a simple equality evaluation on the same fact attribute (i.e., measuring the 'hashed condition literals' approach that several engines employ).

Charles Forgy said...

I agree that it is essential that all engines be running the same rules. We all owe James Owen a big debt of gratitude for his efforts over the last several years to keep benchmarking fair.

Regarding warm up time, I think the main reason for doing this is the Java runtime. The runtime starts out interpreting everything, and compiles methods only as it determines that they are used heavily. (And modern runtimes will even recompile some parts of the program if they determine the first compilation was suboptimal.) For the first part of the run of any Java program, times include both interpretation overhead and compilation overhead. It takes some time for this to settle down, especially if the "-server" option is being used.

I do not agree at all with those who say that benchmarks such as Waltzdb are useless because they do constraint satisfaction instead of ... (insert your favorite task). What particular task a benchmark program does is unimportant; the only important issue is how the benchmark stresses the rule engine. There are several ways to stress a rule engine: by having a large number of rules, by having a large working memory, by requiring a large amount of combinatorial search in the match phase, by having a high rate of change of working memory, etc. Waltz and Waltzdb do pretty well on the amount of combinatorial search required. It would be nice to have benchmarks that stress other factors.

Charles Young said...

One of the impressive performance characteristics of CLIPS 6.3 compared to Java and .NET engines is that it doesn't appear to suffer significantly from the 'first-hit' problem. With Java and .NET engines, there is a significant 'first hit' cost which takes a few iterations to disappear. Charles has identified what must surely be the major cause for this.

The issue I was highlighting is that if you make measurements after just a few iterations, you miss a more subtle, but ultimately very significant performance characteristic. The first few hits in CLIPS are about as good as it gets. If you keep on iterating a test hundreds or thousands of times, the performance will decay. Fortunately, the rate of decay will decrease until the resulting curve flattens out to a constant time. It can take thousands of iterations to get to a completely flat line, though you do get a fairly good idea of where things are heading after just a few hundred iterations. Last time I looked at CLIPS 6.3, I was seeing something approaching 20% decay in performance.

CLIPS is by no means the only engine to exhibit this behaviour (I'm not having a go at your engine, Gary - honest). It is a common characteristic that I've seen for every engine I've ever tested in this way. My understanding is that it is principally related to heap fragmentation (glad to be corrected there if I am wrong). It's an important consideration because in many scenarios, we do indeed invoke the same rule set again and again within the same process. Certainly in my line of business this is the norm. Long-term performance characteristics are far more relevant to what I do than first hit behaviour. Different engines suffer from these longer-term performance issues to different extents, so performance differences seen after just a few iterations can be quite different to what you see after thousands of iterations. This is why most professional load / stress testing only samples performance after a system has been running continuously for several hours.

Can I raise another small point? I admit up-front to being highly pedantic here. My question is what actually constitutes a 'stress' test. Wikipedia suggest that it is all about pushing a system towards breaking point, but I don't think that is correct. I always thought of that as a form of 'load' testing. Surely 'stress' is what happens when a system is pulled in different directions at the same time. Stress-testing a web site, for example, means seeing how it behaves when asked to simultaneously cope with many different tasks (especially under load, so clearly stress testing is often a form of load testing as well). From that perspective, running a single test like Waltz DB or Manners on a single thread with varying volumes of data input doesn't truly constitute a 'stress' test. Just my (pedantic) opinion, of course.

Daniel Selman said...

I've written fairly extensively about JRules performance here:
http://blogs.ilog.com/brms/category/jrules/jrulesperformance/

However, the topic is huge and I never seem to have enough time to fully do it justice. As you can see, "performance" spans many dimensions.

I still think one of the best articles on benchmarks is by Jon Stokes:
http://arstechnica.com/old/content/1999/04/benchmarking.ars

He neatly summarizes the technical and political challenges for users and vendors alike.

Sincerely,
Dan