Skip to content

SAD

Overview

The Basement project is the first layer to handle element of database layer of market dataset from providers and transfer to into the data warehouse.

Table of Contents:

SAD - System Architechture Design

Tech Stack

The list of tech stack that used in the project:

Stack Selection
Programming Python, Bash
with [ORM: sqlalchemy], [Test framework: pytest], httpx
ORM sqlalchemy
Database MySQL, BigQuery
Storage Cloud Storage
Message Bus Pubsub
Communication Slack, Email
Schedule Crontab Linux
CICD GitHub Actions, CloudBuild
Code Storage GitHub
Local development Docker, Docker compose
Oschestration Crontab Linux, Prefect
Network OpenVPN

Logical View

Thea project consistent with this flow

flowchart LR
  s[Schedule] -- trigger --> pipeline[Pipelines]
  pipeline -- pull --> provider[Provider]
  provider -- push --> e[Endpoint] -- interactive --> d[Database]
  provider -- push --> l[Log] -- store --> g[GCS]
  provider -- push --> pubsub[PubSub] -- send metadata of job --> bq[BigQuery]
flowchart LR
  subgraph Architecture
    direction BT
    subgraph Basement
        direction LR
        pipeline -- pull --> provider[Provider]
        provider -- push --> e[Endpoint]
        provider -- push --> l[Log]
        provider -- push --> pubsub[PubSub]
    end
    subgraph Orchestration
        direction LR
        scheduler
    end
    scheduler --> pipeline
  end
  e[Endpoint] -- interactive --> id1[(Database)]
  l[Log] -- store --> g[GCS]
  pubsub[PubSub] -- send metadata of job --> bq[BigQuery]

Physical View

flowchart LR

  onprem_vm[Onpremies Server]

  subgraph gcp[Google Cloud Platform]
    gcs[Cloud Stroage]
    sql[Cloud SQL]
    pubsub[Cloud PubSub]
    bq[BigQuery]

    gcs --> bq
    sql --> bq

  end

  onprem_vm <--> gcp

Deployment

Deployment related to trigger when push on specific branch

flowchart LR
    gitHub[GitHub] -- on push --> trigger[GitHub Actions] -- deploy --> onprem[Onprem server]

To connect or interactive with the server, using VPN is required

flowchart LR

subgraph gcp[Google Cloud Platform]
  services
end

subgraph innotech[Innotech]
  direction LR
  vpn_server
  subgraph server_component[Servers Component]
    s2[172.16.5.2]
    s4[172.16.5.4]
  end
end

gcp -- access  --> vpn_server -- verify/reject --> server_component

Physical Server

The following

Type Server
Onpremies Server 172.5.5.2 and 172.5.5.4
Prefect Server 34.124.143.40 on port 4200

Secret Management

Map between Google Secret Manager ↔ GitHub Action

For Github Action, required admin privilegde to change the secret

Component

Configuration for Cloud

Set up - Service Account Permision

[1] Publish message

  • To get the permissions that you need to publish messages to a topic, ask your administrator to grant you the Pub/Sub Publisher (roles/pubsub.publisher) IAM role on topic. For more information about granting roles, see Manage access.

[1] Database

[2] Authentication for machine of runner

[3] Sudo permission/ or at least CRON executor

The service account is required to access:

[1] Cloud SQL: Read/Write/Create

[2] Cloud GCS: Upload log/output

In order to run the tests against your own bucket, make sure to set up a Service Account with all necessary permissions:

storage.objects.get storage.objects.list storage.objects.create storage.objects.update storage.objects.delete

All five permissions listed above are e.g. included in the predefined Cloud Storage IAM Role roles/storage.objectAdmin.

Set up - Account

Account Action
runner_kunkka Runner for execute pipeline, script
runner_oracle Runner for healthcheck, reboot server

Set up - System Dependencies

Services Description
dnf A package manager for RPM-based Linux distributions
epel-release The EPEL repository was created by the EPEL group working at the Fedora Project.
wget Retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols.
curl Perform requests HTTPS
vim text editor has been the standard editor in all UNIX and Linux-based systems
unzip Zipping and unzipping files is not as simple of a task in Linux
make GNU Make tools controls the generation of executables of a program.
openssl-devel A toolkit for supporting cryptography.
bzip2-devel A high-quality data compressor.
htop A cross-platform interactive process viewer
telnet Local area network to provide a bidirectional interactive text-oriented communication
yq A lightweight and portable command-line YAML, JSON and XML processor

Set up - Service Component

Services Description Note
Python3.9 Python is an interpreted high-level general-purpose programming language. Required >= 3.6
git Source Version Control
docker Docker daemon
gcloud-cli Command Line interactive with Google Cloud resources
docker-compose Compose multiple services from Docker file
yq A lightweight and portable command-line YAML, JSON and XML processor mikefarah/yq
CronD Daemon to run the scheduled jobs in the environment
google-chrome Handle the browser tool in the server Version 106.0.5249.119

Set up - SSH connection

Config the SSH connection for production

Follow GitHub Actions Deployment From zellwk

Schedule Requirement

The demand of this is with any schedule libraries or packages need to satified/built-in or can build

Design Requirement for Schedule

  1. Alway trigger full circle when reboot

  2. Set a log stream file

  3. Has a communication system.

  4. Reboot alerts.

Securities

For our development server, we need to achieve the stage of Secure-Utilize to make develop an environment for our developers.

Required

[1] Postgres Connection String

34.1. Database Connection Control Functions

[2] Create Access [Maintancer Only]

Task

Target

Priority

Note

Root User Protect

Credential need to be frequency update

Change Root User Password

Storage Places of Root/Credential of program system: Database,…

High

System Healths

Turn On/Off when downgrade

Monitor System Access database

Read/Write in specific schema

Whitelist Domain, IP [Database] Database Access Policy

GitHub Action configuration

The introduction to config the secret in GitHub Action

Firewall

Inbound

GitHub Server

Retrieve a list of GitHub's IP addresses from the meta API endpoint

curl -X GET https://api.github.com/meta

Example component

...
  [
    "20.201.28.144/32",
    "20.205.243.164/32",
    "102.133.202.243/32",
    "20.248.137.52/32",
    "20.207.73.86/32",
    "20.27.177.117/32",
    "20.200.245.241/32",
    "20.233.54.55/32"
  ],
  "pages": [
    "192.30.252.153/32",
    "192.30.252.154/32",
    "185.199.108.153/32",
    "185.199.109.153/32",
    "185.199.110.153/32",
    "185.199.111.153/32",
    "2606:50c0:8000::153/128",
    "2606:50c0:8001::153/128",
    "2606:50c0:8002::153/128",
    "2606:50c0:8003::153/128"
  ],
  "importer": [
    "52.23.85.212/32",
    "52.0.228.224/32",
    "52.22.155.48/32"
  ],
  "actions": [
    "4.148.0.0/16",
    "4.149.0.0/18",
    "4.149.64.0/19",
    "4.149.96.0/19",
    "4.149.128.0/17",
    "4.150.0.0/18",
    "4.150.64.0/18",
    "4.150.128.0/18",
    "4.150.192.0/19",
    "4.150.224.0/19",
    "4.151.0.0/16",
    "4.152.0.0/15",
    "4.154.0.0/15",
    "4.156.0.0/15",
  ],
...

Outbound:

scheme:host https:iboard.ssi.com.vn https:finfo-iboard.ssi.com.vn https:fundamental-ssi.fiintrade.vn https:api.vietstock.vn https:vsd.vn https:hnx.vn https:www.hsx.vn https:s.cafef.vn https:streaming-api.hsc.com.vn http:f319.com

Appendix

Appendix A: Record of Changes

Table: Record of changes

Version Date Author Description of Change
0.0.1 01/01/2024 Bao Truong Initation documentation
0.0.2 01/10/2024 Bao Truong Added Component when deploying new server
0.0.3 01/15/2024 Bao Truong Added Pip packages for Python
0.0.4 01/17/2024 Bao Truong Added crond service health check
0.0.5 01/25/2024 Bao Truong Revamp scripts
0.2.0 02/02/2024 Bao Truong Updated layout of SAD
0.2.0 05/17/2024 Bao Truong Updated component from basement

Source Reference

[2] Backup options with detail implement with OpenVPN of Gastkemper at How to use OpenVPN in Github Action workflow