Skip to content

Streaming from Data Layer

Table of Contents:

Overview

Stream process streaming data from datalayer into the lake house.

  • The Change Data Capture (CDC) responsibility to transfer data from OLTP to OLAP.

  • In the past, we implemented CDC by using DataStreaming, but it had some problems for our company:

  • The cost is too expensive, details:

  • We don't need to stream data, we only need the batch data

  • Because we have to use a template on GCP, so:

    • We have to design SCD2 on OLTP

    • We can't customize the transfer flow

  • So we need to implement other CDC tools, that with less expensive and support batch data. The suitable tool is Airbyte with some reasons:

  • The cost is too cheap, details:

  • It supports a strong schedule feature for batching data

  • We don't need to design SCD2 on OLTP

Source: MySQL

Target: BigQuery

SAD - System Architecture Design

Logical View

The project consistent with this flow

flowchart TB

  subgraph cdc_tool[CDC Tool]
    direction TB
    cdc_tool_log[Log]
    cdc_tool_ui[UI]
  end

  subgraph oltp[OLTP]
    oltp_data[Data]
    oltp_wal[WAL]
  end

  subgraph olap[OLAP]
    olap_data[Data]
  end

  subgraph transfer_flow[Transfer Flow]
    direction LR
    oltp <-- read WAL --> cdc_tool
    cdc_tool -- write --> olap
  end

  schedule[Schedule] -- trigger --> transfer_flow

Physical View

flowchart LR
  rdms[RDMS ] --> datastream --> bigquery
flowchart TB


  subgraph airbyte[Airbyte]
    airbyte_debezium[Debezium]
  end

  cloud_sql[Cloud SQL]
  big_query[Big Query]

  subgraph transfer_flow[Transfer Flow]
    direction LR
    cloud_sql <-- read WAL --> airbyte
    airbyte -- write --> big_query
  end


  prefect[Prefect] -- trigger --> transfer_flow

Entities

Stream ID: dawnbreaker2magnus

Deployment

Deployment related to trigger when push on a specific branch

flowchart LR
    gitHub[GitHub] -- on push --> trigger[GitHub Actions] -- deploy --> cloud_compute[Cloud compute]

Set up - Service Account

The service account is required to access:

[1] Cloud SQL: Read WAL

[2] Cloud GCS: Upload WAL

[3] BigQuery: Upload log and data

To run the tests against your bucket, make sure to set up a Service Account with all necessary permissions:

  • compute.instanceAdmin.v1

Set up - Compute Engine

  • Zone: asia-southeast1-b
  • Machine type: e2-medium
  • Image: Debian GNU/Linux 11 (bullseye)
  • Ram: 4GB
  • vCPU: 2
  • Disk: 30GB

Set up - System Dependencies

Services Description Note
docker Docker daemon
docker-compose Compose multiple services from Docker file
htop A cross-platform interactive process viewer

Configure a Cloud SQL for MySQL database Enable binary logging To enable binary logging for Cloud SQL for MySQL, see Enabling point-in-time recovery. Create a Datastream user To create a Datastream user, enter the following MySQL commands:

CREATE USER 'datastream'@'%' IDENTIFIED BY '[YOUR_PASSWORD]'; GRANT REPLICATION SLAVE, SELECT, REPLICATION CLIENT ON . TO 'datastream'@'%'; FLUSH PRIVILEGES;

The idea of this container pipeline are:

(a) seperated the READ/WRITE for targeted database RDMS into Columnar format (BigQuery)

(b) Apply the SCD Type 2 for targeted database - History will be added as a new row.

See the documentation at: https://www.sqlshack.com/implementing-slowly-changing-dimensions-scds-in-data-warehouses/

Source Reference

https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics

https://cloud.google.com/datastream/docs/configure-your-source-mysql-database

https://cloud.google.com/datastream/docs/sources-mysql#mysqlknownlimitations

The full documentation for the pipeline available at Pipeline in Confluent