OCI provides a rich set of services and tools to modernise from a legacy Hadoop solution to a modern data lake. The approach here is more or less the same than other cloud providers are proposing. Let’s dive into the main steps a little bit.
step 1: provision ephemeral hadoop on cloud
OCI BIg Data Service is a Hadoop distribution running on OCI infrastructure. You can provision a cluster in couple of hours or so.
STEP 2: Hadoop replication
We are going to replicate Hadoop data with distcp Hadoop tool
STEP 3: provision ephemeral object storage in oci for migration purposes
The purpose of this storage is a temporary destination for data replicated to ephemeral Hadoop in order to migrate to the new data-lake
STEP 4: copy data from ephemeral hadoop to ephemeral object storage
For the data copy process between ephemeral Hadoop and ephemeral object storage we are using the odcp tool provided by OCI Big Data Service
STEP 5: Provision new data-LAKE layers AND TOOLS in oci
Provision OCI Object Storage buckets and a choice of different flavours for the SQL engine (ADW, ExaCS, MYSQL)
step 6: migrate spark processes to oci and/OR BUILD NEW PIPELINES
OCI provides a comprehensive choice of tools for creating pipelines (batch and streaming), from serverless/low code visual tools that create spark code automatically such as OCI Data Integration or GG Stream Analytics to serverless spark service such as OCI Data Flow or other spark serverfull possibilities such as OCI Kubernetes Engine (OKE)
STEP 7: temporary data coexistence
In addition to the tools mentioned above for replicating and copying data, OCI provides a comprehensive choice of tools for the coexistence of legacy Hadoop data and the new data-lake.
With the use of these tools we can “adjust” the migration process just in case it is needed:
- olh, is a tool for data load from Hadoop to Oracle databases
- cp2hadoop, is a tool for copying data from Oracle databases to Hadoop
- osch, is a collection of binaries and libraries that provides performant SQL queries from Oracle databases to Hadoop
step 8: DUPLICATE INGESTION TO OCI AND test new pipelines
step 9: provision new path for consumers
STEP 10: Decommission legacy AND EPHEMERAL infra
That’s all, hope it helps! 🙂