Monday, June 16, 2008

Parallel Rulebase Systems and Homeland Security

I first wrote about this back in November 2006, in which I chatted aimlessly about Forgy and Gupta and others and what they had done. Since that time, I've been doing quite a bit of reading, all of which I will post somewhere that everyone can get the papers and read for themselves. Meanwhiile, let's take a look at what we "think" we can do before we do it.

First, most of the processing time goes into the match process - about 85% to 90% of the total time involved. Moving from there you can get about a 10 times improvement on the speed of that process alone. You can get improvements on the rest of the processes as well but they are dwarfed in significance by the match process. There is one company that has done quite a bit of work for IBM and others on parallelizing processes in C and C++, the Rapid Mind company, out of the frozen north (Canada) - but they, too, can only seem to get a 10 to 1 improvement.

Two things to consider here: (1) Unless Dr. Forgy can do something that nobody else can do, we won't get much better performance on small systems using parallel rulebased systems. (2) CLIPS has showed us how to get about 250 to 1 improvement on rulebase performance using proper indexing and proper writing of code. Gary Riley (the CLIPS guy) and Dr. Forgy will be at the October [Technical] Rules Fest in the DFW area in, well, October. See for more details on that one.

What I didn't discuss was what is the need for speed? Aren't most rulebased systems sufficiently fast enough as it is? The answer for the business applications, yes. But the answer for research, defense, homeland security, NO! The problem is that in the business world the match process is not terribly complicated BECAUSE the KE (Knowledge Engineer) who oversees the program won't allow it to happen BECAUSE the KE knows that the match process is the problem child of the rulebase world.

Unfortunately, you can't avoid the many objects, many patterns, many rules matching process in the "Big Boy" applications. For example, Homeland security (and I'm not telling things that are in any way a secret here) has between 500K and 2M rules, most of which are small LHS (Left Hand Side) with only a few CE's (Condition Elements) that have to be matched against thousands of ports of entry against millions of travelers. WOW! Even with TeraBytes of RAM you won't be able to process all of that in this century UNLESS you can parallelize most of it. Now this is where the match problem will really dominate.

Let's look at another problem: That one that you see on all of the crime shows so often, the DNA match process. Most of the time they are matching only a small number of DNA samples, usually less than a few thousand, and NOT matching on a national database sized sample. In the UK they match on their own database - one that is growing by leaps and bounds daily - but not on all of the EU. By the time they get a hit on a suspect, days or months later, the suspect has moved and left a cold trail behind or returned to his home country. (I use "his" rather than "her" because I've heard of very few female terrorist except for the poor, misguided Daughters of Islam who have been beguiled into becoming a human bomb.) In R&D or in psychology, the very large database of objects along with many, many rules (usually used in lieu of a neural net) would benefit from a parallel matching process as well.

OK - how can we get this financed? Banks, insurance companies, stock markets, none of them have that problem or they can code around the problem. R&D? No money. Government? Ah, there's the Honey Pot!! Now that we know where the Honey Pot is located, how can we get them (the government watch dogs) to open the lid for us? Simple - show them how it would work on a similar problem that they might have. So, on that thought, does anyone have a simple 2,000,000 rules that can be associated with 5,000 ports (5,000,000 if you include non-listed ports) along with 5,000,000 possible entries? Didn't think so. But Uncle Sam does. Now, if Uncle Sam would just let us play with this problem for a year or two, we could get Homeland Security down pat. But they won't. Not without so many bureaucratic layers that nothing will get done. So, I guess we'll just sit and wait for the mushroom clouds to show up on the horizon. Or next door. See you in ....


1 comment:

woolfel said...

From what I've seen, the top 5 financial institutions do have a need for this kind of extreme performance. Many financial firms still use batch processing due to performance limitations.

It's one of the reasons said tabet and I worked on distributed RETE. The "big guys" have gigantic datasets and heafty performance requirements.

In my bias opinion, partitioning the data and using distributed beta node indexes is the best solution to the problem. It allows for horizontal scaling with cheap commodity hardware and makes it so each engine doesn't have to load all facts into memory. Instead, we just replicate the indexes. In the really extreme cases, one could partition the indexes for dataset is in the petabyte range. Replicating the data across multiple rule engine instances isn't practical at all for the extreme situation.

Distributed RETE was inspired by how RDBMS handle distributed indexes and table partitions. Some data grid products like coherence are already exploring these concepts with distributed work managers.