Skip to content

Introduction

Overview

Introduction of overview

Feature:

  • Search by natural language for Vietnam

  • Transform by de-dup model

  • Basic sentiment

  • Topic classification

  • Fact Check Dataset

SAD - System architechture design

Logical View

Based on the news provided by multiple provider, it will be fetched for the first time and storage in the internal level.

Then frequency, it's will be recogized in the staging envirnoment, which is the centralized portal of news with registed ID from Innotech.

And the staging will standardize, annotate metadata, resource linked and filter based on some rules on protection news.

Then its will serve in the consume a

flowchart LR

  %% Component
  subgraph source_element[Source Element]
    direction TB
    source_1
    source_2
    source_dot[...]
    souce_n
  end

  subgraph staging[Centralized Component]
    direction TB
    internal
  end

  subgraph consume[Consume ]
    direction TB
    restful[RestfulAPI - Spectrum]
    cdn[Bucket CDN]
  end

  %% Flow
  source_1 & source_2 & source_dot & souce_n -- centralized --> internal
  internal --> restful
  internal --> cdn

The element related to each source:

URL Provider Provider Type Region
https://vietstock.vn/ Vietstock Published News Vietnam
https://vir.com.vn/ Vietnam Investment Review Published News Vietnam
https://cafebiz.vn/ Cafebiz Published News Vietnam
https://tinnhanhchungkhoan.vn/ tinnhanhchungkhoan Published News Vietnam
http://www.taichinhdientu.vn/ taichinhdientu Published News Vietnam
https://tapchitaichinh.vn/ tapchitaichinh Published News Vietnam
https://vneconomy.vn/ Vneconomy Published News Vietnam
https://nangluongvietnam.vn/ Vietnam Energy Association Vietnam
https://www.hsx.vn/ HOSE Association Vietnam
https://www.hnx.vn/vi-vn/ HNX Association Vietnam
https://www.economy.com/economicview/ Economy Published News Global

At June 2024

URL Provider Provider Type Region
https://theleader.vn/ The Leader Published News Vietnam

The schema structure

Entity Relationship Direction

Table: inno_news_provider

Field Type Description
ticker str The security symbol
date str The trade date
open float The open price
high float The high price
low float The low price
close float The close price
volume bigint The total trade volume (in security)

Table: inno_news

Schema:

Field Type Description
ticker str The security symbol
date str The trade date
open float The open price
high float The high price
low float The low price
close float The close price
volume bigint The total trade volume (in security)

topics

timestamp

web-url

web-title

web-header

publisher

published-time

alias

url

headline

category: category in which the article was published. headline: the headline of the news article. authors: list of authors who contributed to the article. link: link to the original news article. short_description: Abstract of the news article. date: publication date of the article.

(1) List of market category: CATEGORY CODE Alternative Trading System ATSS Approved Publication Arrangement APPA Approved Reporting Mechanism ARMS Consolidated Tape Provider CTPS Crypto Asset Services Provider CASP Designated Contract Market DCMS Inter-Dealer Quotation System IDQS Multilateral Trading Facility MLTF Not Specified NSPD Organised Trading Facility OTFS Other OTHR Recognised Market Operator RMOS Regulated Market RMKT Swap Execution Facility SEFS Systematic Internaliser SINT Trade Reporting Facility TRFS

Table: inno_dim_news_category

Schema:

Field Type Description
ticker str The security symbol
date str The trade date
open float The open price
high float The high price
low float The low price
close float The close price
volume bigint The total trade volume (in security)

The category:

POLITICS: 35602

WELLNESS: 17945

ENTERTAINMENT: 17362

TRAVEL: 9900

STYLE & BEAUTY: 9814

PARENTING: 8791

HEALTHY LIVING: 6694

QUEER VOICES: 6347

FOOD & DRINK: 6340

BUSINESS: 5992

COMEDY: 5400

SPORTS: 5077

BLACK VOICES: 4583

HOME & LIVING: 4320

PARENTS: 3955

#!/bin/python3

# Global
from datetime import date as _date
from typing import Optional

# External
from pydantic import ConfigDict, BaseModel, Field

# Application
from app.common import SortQueryContext
from app.schema.response import (
    BaseResultOutputResponseModel,
)


class NewsRight(BaseModel):
    ticker: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    event_id: Optional[int] = Field(default=None, json_schema_extra={"nullable": True})
    event_type: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    date_ex: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
    date_declaration: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
    date_record: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
    date_payable: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
    title: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    description: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    exchange: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    model_config = ConfigDict(from_attributes=True)

    @classmethod
    def get_order_context(cls) -> SortQueryContext:
        return SortQueryContext(
            default=["ticker", "date_ex"],
            orderable_field=[
                {"name": "ticker"},
                {"name": "date_ex"},
                {"name": "date_declaration"},
                {"name": "date_record"},
                {"name": "date_payable"},
            ]
        )


class NewsRightResponse(BaseResultOutputResponseModel):
    data: list[NewsRight] = Field(default_factory=lambda: [])


class NewsTicker(BaseModel):
    date: Optional[_date] = Field(default=None, json_schema_extra={"nullable": True})
    ticker: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    article: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    description: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    source: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    publish_metadata: Optional[str] = Field(default=None, json_schema_extra={"nullable": True})
    model_config = ConfigDict(from_attributes=True)

    @classmethod
    def get_order_context(cls) -> SortQueryContext:
        return SortQueryContext(
            default=["ticker", "date"],
            orderable_field=[
                {"name": "ticker"},
                {"name": "date"},
            ]
        )


class NewsTickerResponse(BaseResultOutputResponseModel):
    data: list[NewsTicker] = Field(default_factory=lambda: [])

TODO

[1] https://www.vinai.io/release-text-translation-model-of-vinai-translate/ [2] https://www.kaggle.com/datasets/rmisra/news-category-dataset [3] https://www.johnsnowlabs.com/sentiment-analysis-with-spark-nlp-without-machine-learning/#:~:text=Using%20Spark%20NLP%2C%20it%20is,most%20interesting%20subfields%20of%20NLP.

Source Reference