Introduction¶
Overview¶
Introduction of overview
Feature:
-
Search by natural language for Vietnam
-
Transform by de-dup model
-
Basic sentiment
-
Topic classification
SAD - System architechture design¶
Logical View¶
Based on the news provided by multiple provider, it will be fetched for the first time and storage in the internal level.
Then frequency, it's will be recogized in the staging envirnoment, which is the centralized portal of news with registed ID from Innotech.
And the staging will standardize, annotate metadata, resource linked and filter based on some rules on protection news.
Then its will serve in the consume a
flowchart LR
%% Component
subgraph source_element[Source Element]
direction TB
source_1
source_2
source_dot[...]
souce_n
end
subgraph staging[Centralized Component]
direction TB
internal
end
subgraph consume[Consume ]
direction TB
restful[RestfulAPI - Spectrum]
cdn[Bucket CDN]
end
%% Flow
source_1 & source_2 & source_dot & souce_n -- centralized --> internal
internal --> restful
internal --> cdn
The element related to each source:
URL | Provider | Provider Type | Region |
---|---|---|---|
https://vietstock.vn/ | Vietstock | Published News | Vietnam |
https://vir.com.vn/ | Vietnam Investment Review | Published News | Vietnam |
https://cafebiz.vn/ | Cafebiz | Published News | Vietnam |
https://tinnhanhchungkhoan.vn/ | tinnhanhchungkhoan | Published News | Vietnam |
http://www.taichinhdientu.vn/ | taichinhdientu | Published News | Vietnam |
https://tapchitaichinh.vn/ | tapchitaichinh | Published News | Vietnam |
https://vneconomy.vn/ | Vneconomy | Published News | Vietnam |
https://nangluongvietnam.vn/ | Vietnam Energy | Association | Vietnam |
https://www.hsx.vn/ | HOSE | Association | Vietnam |
https://www.hnx.vn/vi-vn/ | HNX | Association | Vietnam |
https://www.economy.com/economicview/ | Economy | Published News | Global |
At June 2024
URL | Provider | Provider Type | Region |
---|---|---|---|
https://theleader.vn/ | The Leader | Published News | Vietnam |
The schema structure¶
Entity Relationship Direction¶
Table:
inno_news_provider
Field | Type | Description |
---|---|---|
ticker | str | The security symbol |
date | str | The trade date |
open | float | The open price |
high | float | The high price |
low | float | The low price |
close | float | The close price |
volume | bigint | The total trade volume (in security) |
Table:
inno_news
Schema:
Field | Type | Description |
---|---|---|
ticker | str | The security symbol |
date | str | The trade date |
open | float | The open price |
high | float | The high price |
low | float | The low price |
close | float | The close price |
volume | bigint | The total trade volume (in security) |
topics
timestamp
web-url
web-title
web-header
publisher
published-time
alias
url
headline
category: category in which the article was published. headline: the headline of the news article. authors: list of authors who contributed to the article. link: link to the original news article. short_description: Abstract of the news article. date: publication date of the article.
(1) List of market category: CATEGORY CODE Alternative Trading System ATSS Approved Publication Arrangement APPA Approved Reporting Mechanism ARMS Consolidated Tape Provider CTPS Crypto Asset Services Provider CASP Designated Contract Market DCMS Inter-Dealer Quotation System IDQS Multilateral Trading Facility MLTF Not Specified NSPD Organised Trading Facility OTFS Other OTHR Recognised Market Operator RMOS Regulated Market RMKT Swap Execution Facility SEFS Systematic Internaliser SINT Trade Reporting Facility TRFS
Table:
inno_dim_news_category
Schema:
Field | Type | Description |
---|---|---|
ticker | str | The security symbol |
date | str | The trade date |
open | float | The open price |
high | float | The high price |
low | float | The low price |
close | float | The close price |
volume | bigint | The total trade volume (in security) |
The category:
POLITICS: 35602
WELLNESS: 17945
ENTERTAINMENT: 17362
TRAVEL: 9900
STYLE & BEAUTY: 9814
PARENTING: 8791
HEALTHY LIVING: 6694
QUEER VOICES: 6347
FOOD & DRINK: 6340
BUSINESS: 5992
COMEDY: 5400
SPORTS: 5077
BLACK VOICES: 4583
HOME & LIVING: 4320
PARENTS: 3955
#!/bin/python3
# Global
from datetime import date as _date
from typing import Optional
# External
from pydantic import ConfigDict, BaseModel, Field
# Application
from app.common import SortQueryContext
from app.schema.response import (
BaseResultOutputResponseModel,
)
class NewsRight(BaseModel):
ticker: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
event_id: Optional[int] = Field(default=None, json_schema_extra={"nullable": True})
event_type: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
date_ex: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
date_declaration: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
date_record: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
date_payable: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
title: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
description: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
exchange: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
model_config = ConfigDict(from_attributes=True)
@classmethod
def get_order_context(cls) -> SortQueryContext:
return SortQueryContext(
default=["ticker", "date_ex"],
orderable_field=[
{"name": "ticker"},
{"name": "date_ex"},
{"name": "date_declaration"},
{"name": "date_record"},
{"name": "date_payable"},
]
)
class NewsRightResponse(BaseResultOutputResponseModel):
data: list[NewsRight] = Field(default_factory=lambda: [])
class NewsTicker(BaseModel):
date: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
ticker: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
article: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
description: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
source: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
publish_metadata: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
model_config = ConfigDict(from_attributes=True)
@classmethod
def get_order_context(cls) -> SortQueryContext:
return SortQueryContext(
default=["ticker", "date"],
orderable_field=[
{"name": "ticker"},
{"name": "date"},
]
)
class NewsTickerResponse(BaseResultOutputResponseModel):
data: list[NewsTicker] = Field(default_factory=lambda: [])
TODO¶
-
RSS parser. Example: www.nguoiduatin.vn RSS. Consider Python RSS parser
-
Stock article title sentiment-based classification using PhoBERT
-
Update the documentation for exchange listener
-
[HSX] Add news into single component
-
Build concept for fact check and integrate with operation. Example: Politifact Fact Check dataset
-
Intergrate with event component like VSD resource
[1] https://www.vinai.io/release-text-translation-model-of-vinai-translate/ [2] https://www.kaggle.com/datasets/rmisra/news-category-dataset [3] https://www.johnsnowlabs.com/sentiment-analysis-with-spark-nlp-without-machine-learning/#:~:text=Using%20Spark%20NLP%2C%20it%20is,most%20interesting%20subfields%20of%20NLP.