Saturday, July 26, 2008

How Large is "Large" ??

Greetings, Programs:

WARNING: This is a (long) rambling, philosophic blog - but not a "rant" as in the past. But, if you are new to rules (or even if you have a couple of years experience) you might find it interesting.

OK, so how do mean when we use the terms "large" or "complex" when talking about a rulebase? First, let's consider the term "large" as compared to "medium" or "small." In the 80's the most advanced scientists and rule engineers wrote one of the most sophisticated medical diagnosis software (MYCIN) using a rulebase and it had 587 (or so) rules. Another, written in the late 70's and used through the 80's, was used for configuring large computer systems (REX) ad it had just over 1,000 rules. Today, most rulebase geeks, or most business analysts for that matter, think of "large" in terms of "tens-of-thousands" of rules.

But what does this mean exactly? 10K rules? 20K rules? 50K rules? A couple of years ago I asked the two major vendors what was the largest rulebased system that they had ever designed and successfully implemented. Both said that it was 20K+ rules and that most "enterprise" BRMS consisted of 10K rules or less. Today, even the "little" guys are implementing 10K+ rules by using a spreadsheet approach, called a Decision Table by the vendors, wherein each line in the spreadsheet is a rule. If the spreadsheet has 10 columns that can be true or false, then there are 2^10 decisions to be made, or about 1,024 rules. If you add one more column, they there would be 2,048 rows or rules. (By now you should have guessed that you need to keep the spreadsheets small in terms of the number of decision factors or columns.) The number of action columns in one of these spreadsheets is immaterial since the action for any one rule can be any number of things - we don't really care about that part right now.

Just by comparison, the Homeland Security Department has 500K rules now (running on an IBM mainframe) and it will probably grow to about 2M rules in the next few years. Most are VERY simple rules but the problem is the thousands and thousands of objects. If one rule matches to 10K objects and another rules matches to 20K objects then we have 30K cross-matches in those two rules alone.

A few years ago I had the privilege of dealing with one of the larger insurance companies in the UK who wanted to go into the health insurance business in a big way - they were already in the health insurance business but they were only the tenth largest or so (in the UK) in that part of their business. We looked at the number of spreadsheets that they were already writing and determined that they would have about one spreadsheet for each major complaint, one dealing with just back pain, one with just knee pain, one for just shoulder pain, all of which came under structural or skeletal problems. The knee pain spreadsheet had 512 rows, 19 CE columns and the compression came about ONLY because the vendor had just introduced a N/A or "Any Answer OK" cell for the spreadsheet. Without that, it would have thousands and thousands of rows to be considered.

So, was this a "big" system? In terms of size of the underlying rulebase, I would have say yes because it still was not finished when I left after six months and had grown to more than 50 spreadsheets of this type, some larger, some smaller. The underwriters were still laboring to get their heads around the things that they were saying and put them down on paper.

So, are the rules complex? Meaning, does the CE of each rule, or most of the rules, contain multiple forward-chaining type of propositional logic constraints? What is the average number of CEs in each rule? To answer this question I would suggest that we use the Conflict Resolution technique discussed in the Cooper and Wogrin book, "Rule-Based Programming in OPS5". The book is long since out of print but can still be purchased on-line at many locations in new or almost-new condition. For example, let's look at the following rule (done in modern English, not in the OPS5 syntax - see page 53 in the book for the original source code). If we weight a rule according to the following principles that they have given for specificity

Element Class Name = 1
Predicate Operator With a Constant = 1
Predicate Operator With a Bound Variable = 1
Disjunction = 1
Each Predicate or Disjunction WITHIN a Conjun = 1

So, using that logic for the following rule

IF there is any Student called student (1)
student.placed-in == null (1)
sex-is =
smoker = student.smoke
AND there is any Room called room (1)
number =
capacity = room.max
vacancies > 0 and < room.max (2) == sex-is (1)
room.smokers == smoker (1)

CE1a Element Class Name of Student = (1)
CE1b Predicate With a Constant of place-in being null = (1)
CE2a Element Class Name of Room = (1)
CE2d Each Test in Conjunction where > 0 and < max = (2)
CE2e Predicate With Constant checking for sex (1)
CE2f Predicate With Constant checking for smoker (1)

Accordingly, we can see that the specificity, which should be closely correlated to complexity, is 7 for this rule. [I have put the weighting for each line as a number in parenthesis.] Now, if we "weigh" each rule and divide by the number of rules we should be able to arrive at the overall complexity of any rule set or rulebase. There was a time when vendors not only used MEA (means ends analysis) or LEX for Conflict Resolution but they published how they did it. You won't find it in today's manuals. It seems that they basically use refraction, priority (specificity) and the order in which the rules were entered. BTW, the Cooper and Wogrin book covers in plain detail the examples of MEA and LEX. If only the vendors would put what was right before the convenience of saving time of development and fear of run-time numbers. Remember, with CLIPS, Drools and Jess you can still write your own Conflict Resolution - but that is another subject that I think that I have covered before. Would that you could get the "Big Boys" to actually do Conflict Resolution correctly. :-)

Well, if you enjoyed this, great. If not, well, it wasn't a total waste - at least you learned about another book to read. :-)

(Corrected some typos on 29 July 08)


woolfel said...

I find that rule count isn't a good measure of complexity and isn't maintainable. From what I've seen of pre-trade compliance, many systems hard code the literal values in the rule. This means a firm with 50K customers would easily require 500K rules if each customer has 10 rules. A large firm may have tens of millions of customers, which could result in hundreds of millions of rules. Hard coded, the number of rules become unmaintainable.

When I worked on pre-trade compliance, I took 50K+ rules and reduced them to less than 1K. That's a 50x decrease in rule count and makes it more practical to maintain.

James Owen said...

Peter -

Yeah, that's the "art" of being a good KE; the ability to combine (factor or re-factor) massive numbers of rules down to a manageable size. That's what we did with the spreadsheets at the insurance company, otherwise it would have been totally unmanageable. Unfortunately, with HLS that would be a monumental task since they have 500K rules already and it's growing every day.

As to complexity, I still like the way that Cooper and Wogrin (basically just OPS5) calculate the comlexity of each rule. But this is something that you can't do easily on some systems, such as Visual Rules, since their reports come as a "nodes" where an atomic rule is a node. Unfortunately, you can't calculate the complexity of the node in the C&W method.

Thanks for your comments - always nice to hear from a fellow KE. :-)