Intro

The idea of this post is to provide a pattern that helps improve the performance of a complex information system that has very high cpu consumption and processing times. This pattern allows:

  • Maximize system capacity allowing less DoS
  • Reduce the number of long running requests
  • Keep system more stable

Normal circuit breaker user case

In the following diagram we have configured the mentioned system with a circuit breaker implemented as a WebLogic WorkManager that allows 20 concurrent threads serving requests in parallel for each server. In addition there is a coherence data grid in place that caches the information meanwhile data remains unchanged.

 

smartcircuitbraker1.png

Let’s suppose we have at 12:00:00 an instant load like this:

  • 8 users requesting the operation /ss/cfa/1230 from Madrid
  • 9 users requesting the operation /ss/cfa/1230 from Barcelona
  • 12 users requesting the operation /ss/cfa/1234 from San Francisco

Let’s also suppose that the load balancer sends the requests coming from SFO to server1 and the requests coming from MAD&BCN to server3, therefore:

  • server1 = 12 concurrent requests < 20 maximum allowed by the workmanger
  • server3 = 8 + 9= 17 concurrent requests < 20 maximum allowed by the workmanager

In addition let’s suppose the information for request /ss/cfa/1230 have just been invalidated in cache because several data has been changed at 11:59:59

Let’s assume the following processing times:

  • Daatabase responds with a linear function of 4000ms x number of concurrent requests (*)
  • Coherence cache performs constantly for this ranges of load = 300ms
  • Application responds with a linear function of 200ms x number of concurrent requests (*)
  • Network time is constant = 500ms

Response Time Results:

Formula: TOTAL= network + application + backend(database or gridcache)

  • SFO = 500 + 12 x 200 + 300 = 500 + 2400 + 300 = 3.2 s (cached requests)
  • MAD = 500 + 8 x 200 + 8 x 4000 = 500 + 1600 + 32000= 34.1 s (uncached requests)
  • BCN = 500 + 9 x 200 + 9 x 4000 = 500 + 1800 + 36000 = 38.3 s (uncached requests)

As we can see the, effect of uncached info provokes response times 10 times longer than the cached requests just because this is a heavy processing system.

The smart circuit breaker approach

Let’s suppose we can manage requests on each sever in a way that the same request is only processed one time. Meanwhile the rest of equal concurrent requests are kept waiting until the information of the first request is retrieved.

In this situation server3 will process only one request of the 17 uncached concurrent that arrived at 12:00:00, the rest 16 will be kept sleeping until the unique request finishes and stores the result in cache. See following diagram.

 

smartcircuitbraker2.png

So the calculation for the unique request is the result of the formula when there is only 1 request in place and is not cached:

server3.processing.time.for.the.unique.request = 500 + 1 x 200 + 1 x 4000 = 4.7 s

When the unique request finishes, the rest 16 can be processed, but now they can get the result from cache, therefore the processing time is the sum of waiting time plus the time it takes a cached request:

RT = waittime + cahed.data.processing.time

server3.processing.time.for.the.rest.16.requests = 4.7 s + (500 + 16 x 200 + 300) = 8.7 s

Conclusions

  • The circuit breaker improves response times by reducing them from a range of [34.1-38.3] to a narrower range of [4.7-8.7] seconds, which is 5 times better!
  • Because we are processing only one unique request and the rest that are equal are kept waiting, the workmanager has 19 free threads to process other different requests, therefore the capacity of the system is incremented from (20 -17 = 3) to (20 -1 = 19) which is six times more
  • With the normal circutbreaker approach, more subsequent requests (only 3 more because we already have 17 running and wormanager limit is 20) to /ss/cfa/1230 arriving before the second 34.1, will be processed uncached, this extra requests increase the total processing time because the linnear formula mentioned.
  • With the smart circuitbreaker, new requests before second 4.1 are put waiting, therefore the time for finishing the request remains constan.. In addition the number of free threads is 19, allowing the system to process more cached or uncached requests different than  the /ss/ca/1230. Several of those new requests can potentially be requesting same info, in that case the smart circuit pattern will start working as well.

Assumptions & comments

Normally systems can process several request in parallel with constant response time while there are sufficient processors to dedicate for each request.

But when the number of requests is much more than the number of processors, response time increases because the scheduler starts to spread computer capacity to all the pending requests. This increase is proportional to

RT = (concurrent requestsnumber of processors)/number of processors + overhead

Overhead comes from pools schedulers, timers, context switches, memory management (pagination, caches,…), and all the stuff involved in multi-process/multi-thread computation. If the overhead is low compared with the processing time and the number of request is much higher than the number of cpus, response time is honoring a linear formula this way:

RT = concurrent requests/number of processors

For instance if we have 9 concurrent requests, 1 cpu and response time for one request is 4s, the response time for each request is aproximately

RT ~= (9/1) x 4 = 9 x 4 = 36s

In general, when systems starts to saturate, overheads and internal processing comes to be significant, therefore the system is not working in the optimal constant/linear part of the response curve.

smartcircuitbraker3

Enjoy 😉

Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.