Aquesta publicació es va escriure l’estiu de 2021 a Banyalbufar / l’illa de Mallorca
(*) Plus one coming from the marketplace and several embedded OCI core capabilities
Disclaimer: Depending on the size of your workloads this may not be the best solution for you, anyhow there are more OCI capabilities that can fit your particular use case for sure
The diagram below shows the logical architecture of the modern data lake that we intend to explain. In the next few paragraphs we’ll dive into each component a bit.
After the INGESTION
Data ingestion may be realtime or batch. For the ingestion of raw data previous to be processed in the factory we provide services with capabilities to create incoming ports with serial (Kafka) and random data access (Object Storage) methods. The random access storage can be enriched by using a database if you will.
In addition, we provide a service that serialises the realtime streams for data consistency and temporarily backup purposes and side effects eviction.
0 Object Storage, is an Amazon S3 compatible object storage in the core of OCI, this is the repository for data ingested in a realtime way
1 Streaming, is a serverless Kafka compatible serverless managed service, this is the repository for storing data received in realtime
2 Service Connector is a serverless cero code service for replicating data between sources and targets. If you need some transformation logic, you can do it by using Functions. In this use case we use the connector to store in object the data received in the Kafka stream for data consistency purposes
THE PROCESSING PIPELINES
The machine-tools of our data factory. We provide a comprehensive set of tools, from serverless low code tools to full code spark managed servers in which you only chooses the compute capacity necessary depending on the workload and timeframe to process the data. Please note we provide some tools for realtime pipelines and some others for batch processing, and some of them can be utilised for both.
3 Data Integration, low code data pipelines. This service allows the creation of pipeline process with no line of code, just by using a graphical user interface. The pipelines are deployed in a serverless platform for execution, the underlying technology is Spark
4 GG Stream Analytics, this is not serverless, but it can be provisioned from the marketplace, is a low code streaming pipelines for the creation of custom operational dashboards that provide real-time monitoring and analyses of event streams in an Apache Spark-based system
5 Dataflow, managed service for Spark processing pipelines. You build your Spark pipelines in the ci/cd environment and, once it is tested it iis deployed for execution in the platfform, the underlying technology i Spark
THE LAKE (“La pileta”)
Data will be stored in object storage and powerful databases with in memory, columnar or row data organisation, in transit and at rest data encryption, data compression, self tuning, auto scaling, zero loss of service elastic compute resources management, and much more.
APPLYING INTELLIGENCE BACK ON DATA
Machine learning capabilities are provided by 2 kind of cloud services, either embedded in the database or not
GOVERNANCE, xxxOPS & OBSERVABILITY
A bunch of capabilities for data governance, machine learning ops, devops, ci/cd, IaC, monitoring, logging, notifications, alarms and more.
6 APEX, low code app development embedded in database
8 Data Catalog, managed serverless data governance
CONSUMPTION AND DATA PRODUCTS
Either you provide data internally in your organisation or to third parties, whether magnetising them or To provide data internally in your organisation or to third parties, whether magnetising them or not, data products are the outcomes of the data lake and the data factory built on top of it. The following services provide powerful capabilities to create outgoing ports for consumers, either humans or machines.
0 Object Storage, already mentioned above, this object storage repository provides output ports for data to be consumed in the downstream chain in random mode
1 Streaming, already mentioned above, provides output ports for data to be consumed in the downstream chain in serial mode
8 Analytics, AI-powered, self-service analytics capabilities for data preparation, visualization, enterprise reporting, augmented analysis, and natural language processing
9 API Gateway, is a managed serverless API publishing and management system, which exposes REST services for data consumption from end users and applications in the downstream chain
There are other OCI services that fit into the datalake domain but we stop boring it for now.
That’s all, hope it helps!! 🙂