Streaming from Data Layer¶

Table of Contents:

Streaming from Data Layer
Overview
SAD - System Architecture Design
Source Reference

Overview¶

Stream process streaming data from datalayer into the lake house.

The Change Data Capture (CDC) responsibility to transfer data from OLTP to OLAP.
In the past, we implemented CDC by using DataStreaming, but it had some problems for our company:
The cost is too expensive, details:
We don't need to stream data, we only need the batch data
Because we have to use a template on GCP, so:
- We have to design SCD2 on OLTP
- We can't customize the transfer flow
So we need to implement other CDC tools, that with less expensive and support batch data. The suitable tool is Airbyte with some reasons:
The cost is too cheap, details:
It supports a strong schedule feature for batching data
We don't need to design SCD2 on OLTP

Source: MySQL

Target: BigQuery

SAD - System Architecture Design¶

Logical View¶

The project consistent with this flow

flowchart TB

  subgraph cdc_tool[CDC Tool]
    direction TB
    cdc_tool_log[Log]
    cdc_tool_ui[UI]
  end

  subgraph oltp[OLTP]
    oltp_data[Data]
    oltp_wal[WAL]
  end

  subgraph olap[OLAP]
    olap_data[Data]
  end

  subgraph transfer_flow[Transfer Flow]
    direction LR
    oltp <-- read WAL --> cdc_tool
    cdc_tool -- write --> olap
  end

  schedule[Schedule] -- trigger --> transfer_flow

Physical View¶

flowchart LR
  rdms[RDMS ] --> datastream --> bigquery

flowchart TB


  subgraph airbyte[Airbyte]
    airbyte_debezium[Debezium]
  end

  cloud_sql[Cloud SQL]
  big_query[Big Query]

  subgraph transfer_flow[Transfer Flow]
    direction LR
    cloud_sql <-- read WAL --> airbyte
    airbyte -- write --> big_query
  end


  prefect[Prefect] -- trigger --> transfer_flow

Entities¶

Stream ID: dawnbreaker2magnus

Deployment¶

Deployment related to trigger when push on a specific branch

flowchart LR
    gitHub[GitHub] -- on push --> trigger[GitHub Actions] -- deploy --> cloud_compute[Cloud compute]

Set up - Service Account¶

The service account is required to access:

[1] Cloud SQL: Read WAL

[2] Cloud GCS: Upload WAL

[3] BigQuery: Upload log and data

To run the tests against your bucket, make sure to set up a Service Account with all necessary permissions:

compute.instanceAdmin.v1

Set up - Compute Engine¶

Zone: asia-southeast1-b
Machine type: e2-medium
Image: Debian GNU/Linux 11 (bullseye)
Ram: 4GB
vCPU: 2
Disk: 30GB

Set up - System Dependencies¶

Services	Description	Note
docker	Docker daemon
docker-compose	Compose multiple services from Docker file
htop	A cross-platform interactive process viewer

Configure a Cloud SQL for MySQL database Enable binary logging To enable binary logging for Cloud SQL for MySQL, see Enabling point-in-time recovery. Create a Datastream user To create a Datastream user, enter the following MySQL commands:

CREATE USER 'datastream'@'%' IDENTIFIED BY '[YOUR_PASSWORD]'; GRANT REPLICATION SLAVE, SELECT, REPLICATION CLIENT ON . TO 'datastream'@'%'; FLUSH PRIVILEGES;

The idea of this container pipeline are:

(a) seperated the READ/WRITE for targeted database RDMS into Columnar format (BigQuery)

(b) Apply the SCD Type 2 for targeted database - History will be added as a new row.

See the documentation at: https://www.sqlshack.com/implementing-slowly-changing-dimensions-scds-in-data-warehouses/

Source Reference¶

https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics

https://cloud.google.com/datastream/docs/configure-your-source-mysql-database

https://cloud.google.com/datastream/docs/sources-mysql#mysqlknownlimitations

The full documentation for the pipeline available at Pipeline in Confluent