SAD¶
Overview¶
Design related SAD for Financial Fundamental, which is a collection of datasets that represented component of financial statements of public companies.
Metadata:
Attribute | Information |
---|---|
Provider | State Securities Commission of Vietnam - SSC (Government) |
Tags | fundamental , public-resource |
Element | Metadata, Report file (Excel, PDF), Statement |
System Architechture Design (SAD)¶
Logical Concept¶
The logical concept to handle the dataset component. The process trigger to Job with following step:
-
Step 1: Fetch from provider the datasets and file (in raw style: excel, pdf)
-
Step 2: Push/validate metadata and parse into structured metadata and with annotated information
-
Step 3: Store raw file format into bucket
Then will push all element into transformation pipeline and serving output.
flowchart LR
%% Component
oschestration[Oschestration]
provider[Provider]
job[Job]
transform_pipeline[Transform Pipeline]
output[Output :API:, :File:]
database[Database]
bucket[Bucket]
%% Element
subgraph storage[Storage Engine]
database
bucket
end
%% Flow
oschestration -- trigger --> job
job -- [1] fetch resource (from API) or download raw --> provider
job -- [2] push metadata, structured output --> database
job -- [3] store raw file (report, statement) --> bucket
database --> transform_pipeline
bucket --> transform_pipeline
transform_pipeline --> output
Pipeline¶
Based on the logical concept, the pipeline has been introduced into the following pipeline.
flowchart TB
%% Component
crontab[CronTab]
frequency[Frequency]
provider[Provider]
mannual[Input Operation]
gcs_internal[Internal Bucket]
gcs_cdn[CDN Bucket]
bigquery[BigQuery]
mysql[MySQL]
dbt[DBT]
spectrum[Spectrum]
job[Job]
workflow[Workflow]
%% Entities
fs_metadata[Statement Metadata]
fs_financial_statement[Statement Dataset]
fs_statement_report[Statement File @pdf @excel]
%% Element
subgraph oschestration[Oschestration]
direction TB
crontab
frequency
end
subgraph basement[Basement Layer]
direction TB
job
provider
mannual
end
subgraph storage[Storage Engine]
direction TB
gcs_internal
bigquery
mysql[MySQL]
end
subgraph ratelimit[Rate Limit]
direction TB
cloud_task[Cloud Task]
end
%% Service
subgraph submarine[Submarine Service]
direction LR
application_submarine[ASGI]
end
subgraph processor[Processor]
direction LR
workflow -- trigger --> application[ASGI Handler]
end
%% Group
subgraph entity[Memory Entity]
fs_statement_report
fs_metadata
fs_financial_statement
end
subgraph transformantion[Transformantion]
dbt
end
subgraph consume[Consume]
spectrum
gcs_cdn
end
%% Trigger
crontab -- trigger --> job --> provider
frequency -- trigger --> mannual
%% Mapping
provider -- create --> fs_metadata
provider -- create --> fs_financial_statement
provider -- download --> fs_statement_report
mannual -- input --> fs_metadata
mannual -- input --> fs_financial_statement
mannual -- upload --> fs_statement_report
%% Handler
fs_metadata --> submarine
fs_financial_statement --> ratelimit --> processor -- RESTful API --> submarine
fs_financial_statement --> gcs_internal
fs_statement_report -- store --> gcs_internal
processor -- [1] fetch --> gcs_internal
mysql -- streaming --> bigquery
application_submarine --> mysql
processor -- direct import --> bigquery
bigquery --> dbt
gcs_internal -- rsync --> gcs_cdn
dbt --> spectrum
For the order of the pipeline, below is more configuration specification on the pipeline.
Detail Service¶
The table described the detail component
Type | Group | Name | Identity | Description |
---|---|---|---|---|
Storage Engine | Storage | Bucket | inno-internal-assets | Bucket CDN for internal assets |
Consume | Storage | Bucket | inno-public-assets | Bucket CDN for public assets |
Oschestration | Codespace | GitHub | inno-basement | Repository contain the jobs to schedule |
Codespace | GitHub | inno-transflow | Repository handle report into spots | |
Codespace | GitHub | inno-processor | Repository contain services baronyx parse structure | |
Transformantion | Codespace | GitHub | inno-lake-prep | Repository contain transformation |
Consume | Codespace | GitHub | inno-spectrum | Repository contain client output by Restful API |
Oschestration¶
There are 2 parts:
-
Part 1: Schedule to fetch data in the background (or in the production level)
-
Part 2: Frequency manual input from data officer
Strategy for oschestration:
a) For the first large scan: happened when the first time scanned all the element in the databse
b) For the on basic frequency:
On daily: Fetch day-by-day with the current metadata until all the element is successed
On the missing records: Fetch/Validate the missing dates in the overall dataset to define the missing date and run-back for that date.
Basement Layer¶
For the job, it's constructed in the GitHub::inno-basement
that storage raw dataset.
The base origin URL: Information Disclosure at congbothongtin.ssc.gov.vn come directly from provider SSC.
The screenshot of sample:
Public Company | Disclosure Event |
---|---|
![]() | ![]() |
Metadata Event | Financial Statement |
---|---|
![]() | ![]() |
Component:
-
Download disclosure event and assign index
-
Handle the metadata component element
-
Construct the metadat element
For the storage, using Hive pattern: Google Hive partitioned queries
Submarine¶
- API cho Financial Spectrum: Financial Statement
Processor¶
-
Using metadata to put file into GCS with hadooop pattern
-
Using sparkEngine Dataprodc to execute component and textures
flowchart LR
provider[Provider] -- send --> internal_bucket -- push --> eventract -- send message --> workflow -- handle --> baronyx -- extract --> output
For the internal of component
flowchart LR
a[List of raw statement] --> b[Parse to Statemenet with detail classification] -- decomposite --> c[Node] -- using complier --> d[List of element]
For the extraction
/*
Concept
=======
1 company (EIN) --> Single Period ---> n report ---> m statement
(Fiscal Y+Q) |__ k snapshot_uuid |___ t statement_uuid
|_ k1
|_ k2 (Latest) ---> Latest statement
| ...
|_ k(n)
But at one time, 1 EIN will reference into latest version.
*/
- Bucket CDN for download file [Pipeline to signed URL] [Sieu]
Spectrum¶
- API tren endpoint:
/financial-statement/balance-sheet/<ticker>
Entity Relationship Diagram¶
Relationship¶
The ERD related dataset element
flowchart LR
%% Component
public_company[Public company] -- reported --> event[Disclousure Event]
event -- contain --> metadata_origin
event -- contain --> metadata_records
event -- contain --> report
report -- contain --> fa[Fundamental Statement]
fa -- included --> fa_bs[Statement: Balance Sheet]
fa -- included --> fa_is[Statement: Income Statement]
fa -- included --> fa_cf[Statement: Cash Flow]
fa[Fundamental Component] --> filing[Fundamental Filling]
Following is the detail schema for the related
erDiagram
PUBLIC_COMPANY ||--o{ DISCLOSURE_EVENT : allows
PUBLIC_COMPANY {
business_licences string "PK, Business Licences"
company_name text "Company name"
company_code int "Tax Code of company. Check Lenght(12)"
trading_platform string "Trading Platform, included HSX, HNX, UPCOM"
ticker string "The abbr stock traing market"
}
METADATA_EVENT ||--o{ DISCLOSURE_EVENT : is
METADATA_EVENT {
string driversLicense "The license #"
string firstName "Only 99 characters are allowed"
string lastName
string phone
int age
}
DISCLOSURE_EVENT {
string carRegistrationNumber
string driverLicence
}
ELEMENT only one to zero or more METADATA_EVENT : makes
Table schema¶
Schema represented for related relationship
Dimension:
The structured dimension dataset for fundamental
Table:
inno_dim_fundamental_report_level
Schema:
Column | Description | Type | Key |
---|---|---|---|
id | Report Level ID | String(4) | PK |
code | Report Level Code | String(50) | |
name | Name of report level | Text | |
vietnamese | Vietnamese related to of report level | Text |
Element:
id | code | name | vietnamese |
---|---|---|---|
SEPA | SEPARATED | Separated | Đơn lẻ |
CONS | CONSOLIDATED | Consolidated | Hợp nhất |
PARE | PARENT | Parent | Công ty mẹ |
Table:
inno_dim_fundamental_statement_type
Schema:
Column | Description | Type | Key |
---|---|---|---|
id | Statement Type ID | String(3) | PK |
code | Statement Type Code | String(50) | |
name | Name of statement | Text | |
vietnamese | Vietnamese related to of statement | Text |
Element:
id | code | name | vietnamese |
---|---|---|---|
BS | BALANCE_SHEET | Balance Sheet | Bảng cân đối kế toán |
IS | INCOME_STATEMENT | Income Statement | Bảng kết quả kinh doanh |
CFI | CASH_FLOW_DIRECT | Cash Flow Direct | Bảng lưu chuyển tiền tệ trực tiếp |
CFD | CASH_FLOW_INDIRECT | Cash Flow Indirect | Bảng lưu chuyển tiền tệ gián tiếp |
Table:
inno_dim_fundamental_business_category
Schema:
Column | Description | Type | Key |
---|---|---|---|
id | Business Category ID | String(4) | PK |
code | Business Category Code | String(50) | |
name | Name of category | Text | |
vietnamese | Vietnamese related to of category | Text |
Element:
id | code | name | vietnamese |
---|---|---|---|
BASI | BASIC | Basic Join Stock Company | CTCP thông thường |
BANK | BANK | Bank | Ngân hàng |
INSU | INSURANCE | Insurance Company | Bảo hiểm |
SECU | SECURITY | Security Company | Chứng khoán |
FUND | FUND | Investment Funds | Quỹ đầu tư |
Based on the each business category, there are different account chart related to each financial statement.
Table:
inno_dim_fundamental_account_chart
Column | Description | Type | Key |
---|---|---|---|
business_category_code | Business Category Code | String(50) | PK, FK (inno_dim_fundamental_business_category.code) |
statement_type_code | Statement Type Code | String(50) | PK, FK (inno_dim_fundamental_statement_type.code) |
account_id | Account ID | Integer | PK |
account_level | Account level | Integer | |
account_prefix | Account prefix | String(20) | |
account_label | Account label | Text | |
account_class | Account class | String(20) |
Origin dataset:
Table: ssc_public_company
Name | Description | Type | Mode |
---|---|---|---|
table_position | Int64 | Required (Primary Key) | |
long_company_name | long_company_name | Text | Nullable |
EIN | Enterprise Identification Number | String(100) | Required (Primary Key) |
exchange_name | exchange_name | Text | Nullable |
ticker | ticker | String(100) | Nullable |
authorized_by | authorized_by | Text | Nullable |
Table: inno_financial_statement
table_id = "inno_financial_statement" partition_on_field = "period" partition_by = "YEAR"
name | type | mode | description | defaultValueExpression |
---|---|---|---|---|
snapshot_uuid | STRING | REQUIRED | Image snapshot UUID of dataset | |
created_at | TIMESTAMP | NULLABLE | The snapshot timestamp when record has been created | CURRENT_TIMESTAMP() |
updated_at | TIMESTAMP | NULLABLE | The snapshot timestamp when record has been updated | |
deleted_at | TIMESTAMP | NULLABLE | The snapshot timestamp when record has been deleted | |
created_by | STRING | NULLABLE | Name of the identifier pushed record to dataset | |
updated_by | STRING | NULLABLE | Name of the identifier updated record from dataset | |
deleted_by | STRING | NULLABLE | Name of the identifier deleted record from dataset | |
statement_uuid | STRING | REQUIRED | UUID represented for a statement for version control | |
period | DATE | REQUIRED | Combine the fiscal year to date for partition purpose | |
ein | STRING | REQUIRED | Enterprise Identifier Number (EIN) of company | |
vendor_id | STRING | REQUIRED | Vendor ID that published event of statement | |
event_id | STRING | REQUIRED | Event ID of disclosure event linked metadata of statement | |
md5_hash_content | STRING | REQUIRED | MD5 content of disclosure event - related to event_id | |
financial_statement_type | STRING | REQUIRED | Financial statement type. | |
financial_statement_form | STRING | REQUIRED | Financial statement form. | |
fiscal_year | INT64 | REQUIRED | Fiscal year | |
fiscal_quarter | INT64 | NULLABLE | Fiscal quarter | |
account_label | STRING | NULLABLE | Account label of statement line | |
account_class | STRING | NULLABLE | Account class of statement line | |
appendix_location | STRING | NULLABLE | Appendix location for account component | |
_value | INT64 | NULLABLE | Unit value in number | |
accumulated_value | INT64 | NULLABLE | Unit value has been accumulated | |
unit_multiple | INT64 | NULLABLE | Unit scale in number | |
unit_currency | STRING | NULLABLE | Unit currency type |
Code of ['balance sheet', 'income statement', 'cash flow indirect' or 'cash flow direct'] Code of ['YEAR', 'QUARTER']
Mapping: Re-hashed the UUID access report
Operator | Component | Value | Report UUID | Statement UUID | Node UUID |
---|---|---|---|---|---|
EIN | 0100100294 | X | X | X | |
+ | Form | QUARTERLY, ANNUAL | X | X | X |
+ | Report Level | CONS, PARE, SEPA | X | X | X |
+ | Fiscal Period (Year + Quarter) | 202401 | X | X | X |
+ | Type | BS, IS, CFI, CFD | X | X | |
+ | Account ID (Label + Class) | Name + Class | X |
Based on the account chart, there are mapping on account chart
For normal, usally use CFI
For bank, the bank use CFD
Current, the insurance not has data (BVH), the security company (SSI)
Procedure¶
For the correction of report¶
The report has been recognized correction:
Detect Rule:
- [Rule 1] The regex on node
composite.report.descriptive_title
withĐính chính
component
Example:
Title | Detect | Recognized |
---|---|---|
Đính chính Báo cáo tài chính tổng hợp năm 2022 | Rule 1 | Correction |
Đính chính Báo cáo tài chính hợp nhất năm 2022 | Rule 1 | Correction |
After the recognized element:
-
Mapping to the related report then matching
correction_report_id
-
Mapping to the related statement then matching
correction_statement_id
-
Bind the rule
is_correction
onmetadata
section
Reference:
-
Term from SEC: SEC data term
Source Reference¶
[1] [Original SSC for published events]: [https://congbothongtin.ssc.gov.vn/faces/NewsSearch]
[2] edgar: An R package for the U.S. SEC EDGAR retrieval and parsing of corporate filings
[3] Financial Statement Data Sets. Reference URL
Appendix¶
Appendix A: Record of Changes¶
Table: Record of changes
Version | Date | Author | Description of Change |
---|---|---|---|
0.1.0 | 06/04/2024 | Bao Truong | Initation documentation |