Lift and Shift of a Long Time Running Distributed Compute Workload | Onprem vs Oracle Cloud Infrastructure


This is the real history of a company running a long-duration workload once at the end of every month, consisting on a very mature solvency risk calculation distributed computing application built with Oracle Coherence and Java. The application behavior is like a progressive degradation of performance because the delays caused by garbage collector events as the memory of each Java process increases along the workload duration. Sometimes the Customer has to kill the process and start again because it takes too much time and seems not to reach and end.

We have performed a benchmark by lifting and shifting the app to Oracle Cloud Infrastructure, testing 2 different worloads. The table below depicts the best results but thanks to Terraform and the flexibility of the cloud, we have been able to execute more than 30 workloads with different topologies and compute shapes either in Virtuel Machine or Bare Metal. Let’s see what happened!

LOCATIONRESOURCESCLUSTER
NODES
TOTAL COHERENCE
INSTANCES
TOTAL CORESTOTAL
MEMORY
WORKLOADDURATION
ONPREMVMWare12963843,072Tb111h
ORACLE CLOUD VM.DenseIO2.24101002403,2Tb16h25min
ONPREMVMWare10963843,072Tb215h
ORACLE CLOUD VM.DenseIO2.24101002403,2Tb210h53min
ORACLE CLOUD VM.DenseIO2.24202004606,4Tb28h5min

Technical details

Cloud environment IaC procedure: Terraform

Time to provission and start the cluster in cloud: 1o-15 minutes

Time to destroy cloud infra: 3-5 mins

Application software, configuration, operating system and data: Identical for each workload either onprem and cloud. Neither improvments nor changes have been done to the application when moved to the cloud. No improvements to cloud operating systems, network and the like have been done, all settings are the default values provided by Oracle Cloud.

Storage Cloud: All nodes reading and writing data from/to a Shared File System, software and logs in shared disk either, no use of local disk at all

Storage OnPrem: Data in NAS, software in local disk

VMWARE environment: More than 6 years working, supposed to be tunned as much as possible, prevoiusly lifted to AWS then moved back to onprem because its costs

Benchmark dates: October-November 2019

Datacenter: eu-frankfurt-1, AD3

IaC Workstaion location: Madrid

Observations

Workload 1: With similar memory and 40% less cores Oracle performs the workload in 40% less time

Workload 2: With similar memory and 40% less cores Oracle performs the workload in 26% less time

Workload 2: Doubling the number of nodes we get a reduction of 27% in duration

That’s all, hope it helps! 🙂

One Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.