Skip to content

SAD

Overview

Design related SAD for Financial Fundamental, which is a collection of datasets that represented component of financial statements of public companies.

Metadata:

Attribute Information
Provider State Securities Commission of Vietnam - SSC (Government)
Tags fundamental, public-resource
Element Metadata, Report file (Excel, PDF), Statement

System Architechture Design (SAD)

Logical Concept

The logical concept to handle the dataset component. The process trigger to Job with following step:

  • Step 1: Fetch from provider the datasets and file (in raw style: excel, pdf)

  • Step 2: Push/validate metadata and parse into structured metadata and with annotated information

  • Step 3: Store raw file format into bucket

Then will push all element into transformation pipeline and serving output.

flowchart LR

  %% Component
  oschestration[Oschestration]
  provider[Provider]
  job[Job]
  transform_pipeline[Transform Pipeline]
  output[Output :API:, :File:]
  database[Database]
  bucket[Bucket]

  %% Element
  subgraph storage[Storage Engine]
    database
    bucket
  end

  %% Flow
  oschestration -- trigger --> job
  job -- [1] fetch resource (from API) or download raw --> provider
  job -- [2] push metadata, structured output --> database
  job -- [3] store raw file (report, statement) --> bucket
  database --> transform_pipeline
  bucket --> transform_pipeline
  transform_pipeline --> output

Pipeline

Based on the logical concept, the pipeline has been introduced into the following pipeline.

flowchart TB

  %% Component
  crontab[CronTab]
  frequency[Frequency]
  provider[Provider]
  mannual[Input Operation]
  gcs_internal[Internal Bucket]
  gcs_cdn[CDN Bucket]
  bigquery[BigQuery]
  mysql[MySQL]
  dbt[DBT]
  spectrum[Spectrum]
  job[Job]
  workflow[Workflow]

  %% Entities
  fs_metadata[Statement Metadata]
  fs_financial_statement[Statement Dataset]
  fs_statement_report[Statement File @pdf @excel]

  %% Element
  subgraph oschestration[Oschestration]
    direction TB
    crontab
    frequency
  end

  subgraph basement[Basement Layer]
    direction TB
    job
    provider
    mannual
  end

  subgraph storage[Storage Engine]
    direction TB
    gcs_internal
    bigquery
    mysql[MySQL]
  end

  subgraph ratelimit[Rate Limit]
    direction TB
    cloud_task[Cloud Task]
  end

  %% Service
  subgraph submarine[Submarine Service]
    direction LR
    application_submarine[ASGI]
  end

  subgraph processor[Processor]
    direction LR
    workflow -- trigger --> application[ASGI Handler]
  end

  %% Group
  subgraph entity[Memory Entity]
    fs_statement_report
    fs_metadata
    fs_financial_statement
  end

  subgraph transformantion[Transformantion]
    dbt
  end

  subgraph consume[Consume]
    spectrum
    gcs_cdn
  end

  %% Trigger
  crontab -- trigger --> job --> provider
  frequency -- trigger --> mannual

  %% Mapping
  provider -- create --> fs_metadata
  provider -- create --> fs_financial_statement
  provider -- download --> fs_statement_report
  mannual -- input --> fs_metadata
  mannual -- input --> fs_financial_statement
  mannual -- upload --> fs_statement_report

  %% Handler
  fs_metadata --> submarine
  fs_financial_statement  --> ratelimit --> processor -- RESTful API --> submarine
  fs_financial_statement --> gcs_internal
  fs_statement_report -- store --> gcs_internal

  processor -- [1] fetch --> gcs_internal
  mysql -- streaming --> bigquery
  application_submarine --> mysql
  processor  -- direct import --> bigquery
  bigquery --> dbt

  gcs_internal -- rsync --> gcs_cdn
  dbt --> spectrum

For the order of the pipeline, below is more configuration specification on the pipeline.

Detail Service

The table described the detail component

Type Group Name Identity Description
Storage Engine Storage Bucket inno-internal-assets Bucket CDN for internal assets
Consume Storage Bucket inno-public-assets Bucket CDN for public assets
Oschestration Codespace GitHub inno-basement Repository contain the jobs to schedule
Codespace GitHub inno-transflow Repository handle report into spots
Codespace GitHub inno-processor Repository contain services baronyx parse structure
Transformantion Codespace GitHub inno-lake-prep Repository contain transformation
Consume Codespace GitHub inno-spectrum Repository contain client output by Restful API

Oschestration

There are 2 parts:

  • Part 1: Schedule to fetch data in the background (or in the production level)

  • Part 2: Frequency manual input from data officer

Strategy for oschestration:

a) For the first large scan: happened when the first time scanned all the element in the databse

b) For the on basic frequency:

On daily: Fetch day-by-day with the current metadata until all the element is successed

On the missing records: Fetch/Validate the missing dates in the overall dataset to define the missing date and run-back for that date.

Basement Layer

For the job, it's constructed in the GitHub::inno-basement that storage raw dataset.

The base origin URL: Information Disclosure at congbothongtin.ssc.gov.vn come directly from provider SSC.

The screenshot of sample:

Public Company Disclosure Event
Screenshot Public Company Screenshot Disclosure Event
Metadata Event Financial Statement
Screenshot Metadata Event Screenshot Financial Statement

Component:

  • Download disclosure event and assign index

  • Handle the metadata component element

  • Construct the metadat element

For the storage, using Hive pattern: Google Hive partitioned queries

Submarine

  • API cho Financial Spectrum: Financial Statement

Processor

  • Using metadata to put file into GCS with hadooop pattern

  • Using sparkEngine Dataprodc to execute component and textures

flowchart LR

  provider[Provider] -- send --> internal_bucket -- push --> eventract -- send message --> workflow -- handle --> baronyx -- extract --> output

For the internal of component

flowchart LR
  a[List of raw statement] --> b[Parse to Statemenet with detail classification] -- decomposite --> c[Node] -- using complier --> d[List of element]

For the extraction

/*
  Concept
  =======

  1 company (EIN) --> Single Period   ---> n report               ---> m statement
                      (Fiscal Y+Q)        |__  k snapshot_uuid        |___ t statement_uuid
                                          |_ k1
                                          |_ k2 (Latest)          ---> Latest statement
                                          |  ...
                                          |_ k(n)

  But at one time, 1 EIN will reference into latest version.
    */
  • Bucket CDN for download file [Pipeline to signed URL] [Sieu]

Spectrum

  • API tren endpoint: /financial-statement/balance-sheet/<ticker>

Entity Relationship Diagram

Relationship

The ERD related dataset element

flowchart LR

  %% Component
  public_company[Public company] -- reported --> event[Disclousure Event]
  event -- contain --> metadata_origin
  event -- contain --> metadata_records
  event -- contain --> report
  report -- contain --> fa[Fundamental Statement]
  fa -- included --> fa_bs[Statement: Balance Sheet]
  fa -- included --> fa_is[Statement: Income Statement]
  fa -- included --> fa_cf[Statement: Cash Flow]
  fa[Fundamental Component] --> filing[Fundamental Filling]
# 1 report-uuid
# ~
# 3k - 1 snapshot-uuid
# ~
# 2n statement-uuid

Following is the detail schema for the related

erDiagram
  PUBLIC_COMPANY ||--o{ DISCLOSURE_EVENT : allows
  PUBLIC_COMPANY {
      business_licences string "PK, Business Licences"
      company_name text "Company name"
      company_code int "Tax Code of company. Check Lenght(12)"
      trading_platform string "Trading Platform, included HSX, HNX, UPCOM"
      ticker string "The abbr stock traing market"
  }
  METADATA_EVENT ||--o{ DISCLOSURE_EVENT : is
  METADATA_EVENT {
      string driversLicense "The license #"
      string firstName "Only 99 characters are allowed"
      string lastName
      string phone
      int age
  }
  DISCLOSURE_EVENT {
      string carRegistrationNumber
      string driverLicence
  }
  ELEMENT only one to zero or more METADATA_EVENT : makes

Table schema

Schema represented for related relationship

Dimension:

The structured dimension dataset for fundamental

Table: inno_dim_fundamental_report_level

Schema:

Column Description Type Key
id Report Level ID String(4) PK
code Report Level Code String(50)
name Name of report level Text
vietnamese Vietnamese related to of report level Text

Element:

id code name vietnamese
SEPA SEPARATED Separated Đơn lẻ
CONS CONSOLIDATED Consolidated Hợp nhất
PARE PARENT Parent Công ty mẹ

Table: inno_dim_fundamental_statement_type

Schema:

Column Description Type Key
id Statement Type ID String(3) PK
code Statement Type Code String(50)
name Name of statement Text
vietnamese Vietnamese related to of statement Text

Element:

id code name vietnamese
BS BALANCE_SHEET Balance Sheet Bảng cân đối kế toán
IS INCOME_STATEMENT Income Statement Bảng kết quả kinh doanh
CFI CASH_FLOW_DIRECT Cash Flow Direct Bảng lưu chuyển tiền tệ trực tiếp
CFD CASH_FLOW_INDIRECT Cash Flow Indirect Bảng lưu chuyển tiền tệ gián tiếp

Table: inno_dim_fundamental_business_category

Schema:

Column Description Type Key
id Business Category ID String(4) PK
code Business Category Code String(50)
name Name of category Text
vietnamese Vietnamese related to of category Text

Element:

id code name vietnamese
BASI BASIC Basic Join Stock Company CTCP thông thường
BANK BANK Bank Ngân hàng
INSU INSURANCE Insurance Company Bảo hiểm
SECU SECURITY Security Company Chứng khoán
FUND FUND Investment Funds Quỹ đầu tư

Based on the each business category, there are different account chart related to each financial statement.

Table: inno_dim_fundamental_account_chart

Column Description Type Key
business_category_code Business Category Code String(50) PK, FK (inno_dim_fundamental_business_category.code)
statement_type_code Statement Type Code String(50) PK, FK (inno_dim_fundamental_statement_type.code)
account_id Account ID Integer PK
account_level Account level Integer
account_prefix Account prefix String(20)
account_label Account label Text
account_class Account class String(20)

Origin dataset:

Table: ssc_public_company

Name Description Type Mode
table_position Int64 Required (Primary Key)
long_company_name long_company_name Text Nullable
EIN Enterprise Identification Number String(100) Required (Primary Key)
exchange_name exchange_name Text Nullable
ticker ticker String(100) Nullable
authorized_by authorized_by Text Nullable

Table: inno_financial_statement

table_id = "inno_financial_statement" partition_on_field = "period" partition_by = "YEAR"

name type mode description defaultValueExpression
snapshot_uuid STRING REQUIRED Image snapshot UUID of dataset
created_at TIMESTAMP NULLABLE The snapshot timestamp when record has been created CURRENT_TIMESTAMP()
updated_at TIMESTAMP NULLABLE The snapshot timestamp when record has been updated
deleted_at TIMESTAMP NULLABLE The snapshot timestamp when record has been deleted
created_by STRING NULLABLE Name of the identifier pushed record to dataset
updated_by STRING NULLABLE Name of the identifier updated record from dataset
deleted_by STRING NULLABLE Name of the identifier deleted record from dataset
statement_uuid STRING REQUIRED UUID represented for a statement for version control
period DATE REQUIRED Combine the fiscal year to date for partition purpose
ein STRING REQUIRED Enterprise Identifier Number (EIN) of company
vendor_id STRING REQUIRED Vendor ID that published event of statement
event_id STRING REQUIRED Event ID of disclosure event linked metadata of statement
md5_hash_content STRING REQUIRED MD5 content of disclosure event - related to event_id
financial_statement_type STRING REQUIRED Financial statement type.
financial_statement_form STRING REQUIRED Financial statement form.
fiscal_year INT64 REQUIRED Fiscal year
fiscal_quarter INT64 NULLABLE Fiscal quarter
account_label STRING NULLABLE Account label of statement line
account_class STRING NULLABLE Account class of statement line
appendix_location STRING NULLABLE Appendix location for account component
_value INT64 NULLABLE Unit value in number
accumulated_value INT64 NULLABLE Unit value has been accumulated
unit_multiple INT64 NULLABLE Unit scale in number
unit_currency STRING NULLABLE Unit currency type

Code of ['balance sheet', 'income statement', 'cash flow indirect' or 'cash flow direct'] Code of ['YEAR', 'QUARTER']

Mapping: Re-hashed the UUID access report

Operator Component Value Report UUID Statement UUID Node UUID
EIN 0100100294 X X X
+ Form QUARTERLY, ANNUAL X X X
+ Report Level CONS, PARE, SEPA X X X
+ Fiscal Period (Year + Quarter) 202401 X X X
+ Type BS, IS, CFI, CFD X X
+ Account ID (Label + Class) Name + Class X

Based on the account chart, there are mapping on account chart

For normal, usally use CFI

For bank, the bank use CFD

Current, the insurance not has data (BVH), the security company (SSI)

Procedure

For the correction of report

The report has been recognized correction:

Detect Rule:

  • [Rule 1] The regex on node composite.report.descriptive_title with Đính chính component

Example:

Title Detect Recognized
Đính chính Báo cáo tài chính tổng hợp năm 2022 Rule 1 Correction
Đính chính Báo cáo tài chính hợp nhất năm 2022 Rule 1 Correction

After the recognized element:

  • Mapping to the related report then matching correction_report_id

  • Mapping to the related statement then matching correction_statement_id

  • Bind the rule is_correction on metadata section

Reference:

Source Reference

[1] [Original SSC for published events]: [https://congbothongtin.ssc.gov.vn/faces/NewsSearch]

[2] edgar: An R package for the U.S. SEC EDGAR retrieval and parsing of corporate filings

[3] Financial Statement Data Sets. Reference URL

Appendix

Appendix A: Record of Changes

Table: Record of changes

Version Date Author Description of Change
0.1.0 06/04/2024 Bao Truong Initation documentation