Migrating from hadoop to a data lake (or lake house if you will) in OCI


OCI provides a rich set of services and tools to modernise from a legacy Hadoop solution to a modern data lake. The approach here is more or less the same than other cloud providers are proposing. Let’s dive into the main steps a little bit.

step 1: provision ephemeral hadoop on cloud

OCI BIg Data Service is a Hadoop distribution running on OCI infrastructure. You can provision a cluster in couple of hours or so.

STEP 2: Hadoop replication

We are going to replicate Hadoop data with distcp Hadoop tool

STEP 3: provision ephemeral object storage in oci for migration purposes

The purpose of this storage is a temporary destination for data replicated to ephemeral Hadoop in order to migrate to the new data-lake

STEP 4: copy data from ephemeral hadoop to ephemeral object storage

For the data copy process between ephemeral Hadoop and ephemeral object storage we are using the odcp tool provided by OCI Big Data Service

STEP 5: Provision new data-LAKE layers AND TOOLS in oci

Provision OCI Object Storage buckets and a choice of different flavours for the SQL engine (ADW, ExaCS, MYSQL)

step 6: migrate spark processes to oci and/OR BUILD NEW PIPELINES

OCI provides a comprehensive choice of tools for creating pipelines (batch and streaming), from serverless/low code visual tools that create spark code automatically such as OCI Data Integration or GG Stream Analytics to serverless spark service such as OCI Data Flow or other spark serverfull possibilities such as OCI Kubernetes Engine (OKE)

STEP 7: temporary data coexistence

In addition to the tools mentioned above for replicating and copying data, OCI provides a comprehensive choice of tools for the coexistence of legacy Hadoop data and the new data-lake.

With the use of these tools we can “adjust” the migration process just in case it is needed:

  • olh, is a tool for data load from Hadoop to Oracle databases
  • cp2hadoop, is a tool for copying data from Oracle databases to Hadoop
  • osch, is a collection of binaries and libraries that provides performant SQL queries from Oracle databases to Hadoop

step 8: DUPLICATE INGESTION TO OCI AND test new pipelines

step 9: provision new path for consumers

STEP 10: Decommission legacy AND EPHEMERAL infra

That’s all, hope it helps! 🙂

Related information:

https://docs.oracle.com/en/solutions/arch-center-about-data-lake/recommended-patterns-cloud-based-data-lakes1.html#GUID-5671E804-898E-400B-8A69-DD54E9001062

https://docs.oracle.com/en/solutions/learn-deploy-hadoop-oci/index.html

https://docs.oracle.com/en/solutions/deploy-goldengate-stream-analytics/index.html#GUID-E2F01B24-EC4E-4A5F-A0BF-B201202D6FD2

https://docs.oracle.com/en/solutions/best-practices-cloudera-on-oci/index.html#GUID-1EC73133-4D4A-4CE4-BE56-135EC7C6E7EE

https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-data

https://databricks.com/blog/2021/08/06/5-key-steps-to-successfully-migrate-from-hadoop-to-the-lakehouse-architecture.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.