Building a modern datalake in OCI with 8 serverless* cloud services


Aquesta publicació es va escriure l’estiu de 2021 a Banyalbufar / l’illa de Mallorca

(*) Plus one coming from the marketplace and several embedded OCI core capabilities

Disclaimer: Depending on the size of your workloads this may not be the best solution for you, anyhow there are more OCI capabilities that can fit your particular use case for sure

The diagram below shows the logical architecture of the modern data lake that we intend to explain. In the next few paragraphs we’ll dive into each component a bit.

After the INGESTION

Data ingestion may be realtime or batch. For the ingestion of raw data previous to be processed in the factory we provide services with capabilities to create incoming ports with serial (Kafka) and random data access (Object Storage) methods. The random access storage can be enriched by using a database if you will.

In addition, we provide a service that serialises the realtime streams for data consistency and temporarily backup purposes and side effects eviction.

0 Object Storage, is an Amazon S3 compatible object storage in the core of OCI, this is the repository for data ingested in a realtime way

1 Streaming, is a serverless Kafka compatible serverless managed service, this is the repository for storing data received in realtime

2 Service Connector is a serverless cero code service for replicating data between sources and targets. If you need some transformation logic, you can do it by using Functions. In this use case we use the connector to store in object the data received in the Kafka stream for data consistency purposes

THE PROCESSING PIPELINES

The machine-tools of our data factory. We provide a comprehensive set of tools, from serverless low code tools to full code spark managed servers in which you only chooses the compute capacity necessary depending on the workload and timeframe to process the data. Please note we provide some tools for realtime pipelines and some others for batch processing, and some of them can be utilised for both.

3 Data Integration, low code data pipelines. This service allows the creation of pipeline process with no line of code, just by using a graphical user interface. The pipelines are deployed in a serverless platform for execution, the underlying technology is Spark

4 GG Stream Analytics, this is not serverless, but it can be provisioned from the marketplace, is a low code streaming pipelines for the creation of custom operational dashboards that provide real-time monitoring and analyses of event streams in an Apache Spark-based system

5 Dataflow, managed service for Spark processing pipelines. You build your Spark pipelines in the ci/cd environment and, once it is tested it iis deployed for execution in the platfform, the underlying technology i Spark

THE LAKE (“La pileta”)

Data will be stored in object storage and powerful databases with in memory, columnar or row data organisation, in transit and at rest data encryption, data compression, self tuning, auto scaling, zero loss of service elastic compute resources management, and much more.

0 Object Storage, already mentioned above, in this case for storing data processed according to the data domains designed for the lake

6 A comprehensive choice of Oracle databases, from virtual machine to autonomous or Exadata deployments that provide SQL and non-SQL capabilities

APPLYING INTELLIGENCE BACK ON DATA

Machine learning capabilities are provided by 2 kind of cloud services, either embedded in the database or not

6 Auto ML in database, machine learning features embedded in the database core

7 Data Science, managed serverless platform for training and managing machine learning models

GOVERNANCE, xxxOPS & OBSERVABILITY

A bunch of capabilities for data governance, machine learning ops, devops, ci/cd, IaC, monitoring, logging, notifications, alarms and more.

0 Resource Manager, DevOps, Visual Builder Studio, are a comprehensive set of tools for the lifecycle of code for apps or IaC embedded in the core of OCI

0 Monitoring, audit, notifications, events, logging, health checks, are a comprehensive set of tools and services for observability embedded in the core of OCI

6 APEX, low code app development embedded in database

8 Data Catalog, managed serverless data governance

CONSUMPTION AND DATA PRODUCTS

Either you provide data internally in your organisation or to third parties, whether magnetising them or To provide data internally in your organisation or to third parties, whether magnetising them or not, data products are the outcomes of the data lake and the data factory built on top of it. The following services provide powerful capabilities to create outgoing ports for consumers, either humans or machines.

0 Object Storage, already mentioned above, this object storage repository provides output ports for data to be consumed in the downstream chain in random mode

1 Streaming, already mentioned above, provides output ports for data to be consumed in the downstream chain in serial mode

8 Analytics, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing

9 API Gateway, is a managed serverless API publishing and management system, which exposes REST services for data consumption from end users and applications in the downstream chain

There are other OCI services that fit into the datalake domain but we stop boring it for now.

That’s all, hope it helps!! 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.