Skip to content

SAD

Overview

Inno Lake Prep is the preprocessing for data from multiple source using columnar processor

SAD - System Architechture Design

Tech Stack

Our component for this warehouse:

  • Build transformation pipelines by DBT, incuded SQL transformations, yml for generic and custom tests, instruction, ...

  • Database backend: BigQuery (columnar database)

  • CICD: Using GitHub Actions, CloudBuild

  • Deployment with Cloud Run

  • Tools: Google Cloud CLI

  • It packages by the container and ship with Cloud Run invoke by Cloud Scheduler.

[1] Developer Process, copy from Christophe Oudar

Architechture Progress

Dbt inside GCP

[2] Environments:

There are 3 envs related with git branch

  • Local development

  • Staging

  • Production.

[3] Architechture Component

[Internal]

inno-lake-prep

  • Run single command from server [Sieu: fastapi + execute server]

  • Linked with Go -> Transfer to fastapi + Cloud Tasks

  • Cost infisicent

  • Overall: 4

  • 1 Cloud Run + 1 Cloud Task

  • Dockerize Server to execute command in backend

  • Cloud Schedule ← Workflow → Trigger Cloud Run → Report

  • Removed Workflow triggered Cloud Schedule

https://robertsahlin.com/serverless-dbt-on-google-cloud-platform/

https://discourse.getdbt.com/t/triggering-a-dbt-cloud-job-in-your-automated-workflow-with-python/2573

The ETL should be control in different context

For new deployment on Lake

Trigger run workflows ETL full-refresh

For a diffrent mode on the trading date

The schedulers revoke the workflows jobs related.

The lake folder represent raw data sources component

An executor for data transformation actions in the cloud analytics warehouse

Serverless with DBT

Mostly tables is defined in schema raw

The yml file declare is based on vendors, global configuration related to.

Component of each lake:

+ Prefix: `lake_*` with `*` represented for vendors code

+ Tables mapping to lake is mostly that has prefix of that vendors.

inno-lake-prep

  • Run single command from server [Sieu: fastapi + execute server]

  • Linked with Go -> Transfer to fastapi + Cloud Tasks

  • Dockerize Server to execute command in backend

  • Cloud Schedule ← Workflow → Trigger Cloud Run → Report

  • Removed Workflow triggered Cloud Schedule

output path of gcs

Output Path

Below is the design SAD for this solution.

GCP-Pipelines-ETL-Morphling

Cloud Run

Name: morphling-target

Cloud Scheduler

Scheduler name: prod_bq2silencer_*_strategy

Note: * is the name of table transfer from Bigquery to PostgreSQL

The above scheduler jobs will transfer data from BigQuery to PostgreSQL.

Screenshot of workflows:

ETL-Flows-Morphling-Workflow-Example

Screenshot of email:

ETL-Flows-Morphling-Email-Example

Note: target is the target deployment of morphling.

The above cloud run will streaming data from Cloud SQL to Bigquery.

[^1] Deployment environment: Currently set on production only. So, for the staging, the way to invoke is mannual by Cloud Run or Workflows

[^2] Scheduler configuration: a) Timezone: ref to https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

[2] Permissions required

For runners: sa-dbt-user@$PROJECT_ID.iam.gserviceaccount.com with roles roles/bigquery.dataEditor to write to BigQuery roles/bigquery.jobUser and run BigQuery jobs

For Cloud Build: use default one

  • Acess secret manager

Cloud Run and Volumes Cloud Run does not yet support the volumes, but you can simulate this support thanks to Secret Manager integration and for a specific use case. I used it for a Spring boot deployment and it’s really helpful!

Source Reference

https://robertsahlin.com/serverless-dbt-on-google-cloud-platform/

https://discourse.getdbt.com/t/triggering-a-dbt-cloud-job-in-your-automated-workflow-with-python/2573

Appendix

Appendix A: Record of Changes

Table: Record of changes

Version Date Author Description of Change
0.1.0 06/04/2024 Bao Truong Initation documentation

  1. Build local dockerize application. Ref: https://cloud.google.com/run/docs/building/containers#builder