Data Product Governance Guidelines

This document expands on the standards, scope, and quality guidelines for data products built and delivered by the Data Engineering team.

1. What: Data Product Scope

1.1. In-Scope Data Products

These guidelines apply to all data products, classified as follows:

Internal Events (EventHub):
- see the list of supported events here
Internal Models & Files:
- Astra eligible users
- Restricted access
- Check eligibility
ML/LLM Models:
- risk score outputs
External APIs & DBs:
- Sardine
- Intercom batch and stream
- Braze batch and stream
- mParticle batch and stream
- Cable
Email Deliveries:
- Kard Impression
One-off reports
- Compliance audit reports
- Bank audit reports

1.2. Out-of-Scope Systems

The following systems are explicitly out of scope for this specific governance framework:

Looker
Themis
WhatsApp project
Ad-hoc Notebooks
Events forwarded in Kinesis

2. Where: Code & Artifact Organization

This section defines where our code, data, and schemas must be stored and how they should be named.

2.1. GitHub Repositories

All data product code must be stored in its designated repository. The goal of the repository is to:

Fetch the data in unchanged format from Data Warehouse (sql select * from the table). Exclude metadata columns, e.g., report_date.
Save to the file/Send to external API
Test that the file is correct. See more in the test section.

Repositories must include:

README.md file with the following sections:
- TODO
Use the code from data-lib and move the repetitive code to the library.

Repository structure:

dt-reporting:
- Purpose: Manages all periodic, file-based data products.
- Examples: SFTP deliveries, majority_rdfs, kard reports.
dt-bq2pubsub:
- Purpose: Manages all data streamed via Pub/Sub.
- Example: EventHub
API Repositories:
- Purpose: Each external-facing API must be housed in its own dedicated repository.
- Examples: sardine, intercom, braze, mparticle, talkdesk
internal files sent to Azure Blob Storage
- Purpose: To trigger changes in the backend data warehouse
- Examples: eligible_users, restricted_access, check_eligibility
- To be Deprecated
ML models repositories:
- dt-ml-models

2.2. dbt

In dbt, we categorize the models into 3 categories:

Reporting Models:
- requires adding to the config schema = 'reporting'
- stored in the models/reporting folder
- Report models should be clearly named to match the product and recipient.
- Pattern: rpt_<recipient>_<description> (e.g., rpt_majority_cards_accounts)
- One model per file produced
  - can include intermediate models to help with the transformation, which should not be used in the marts.
  - keep naming convention for intermediate models and reporting models int_<recipient>_<description>
Event & Delta Models:
- requires adding to the config schema = 'reverse_etl'
- stored in the models/reverse_etl folder
- Event-based models should follow the pattern:
- Pattern:
  - <recipient>_<event_name> (internal for the event hub events, e.g internal_crde_user_model_updated_event)
  - <recipient>_<event_name>_delta
  - <recipient>_<event_name>_previous_run (created in the python code)

2.3. Airflow DAGs

Make sure to follow the guidelines for the DAGs in the airflow documentation. Additionally, each DAG:

Runs the dbt models
Runs the tests and fails the DAG if the tests fail
Reuses the reporting class
Has 0 retries
Has a timeout (to consider)
Variables - don't overuse. No secrets can be hardcoded

Naming convention:

DAG name: <recipient>_<description(reports/events/transactions/etc.)>_<API/SFTP/PubSub/Blob>
Task name
- dbt freshness dbt_freshness_ (advisable to add)
- dbt dbt_run_
- dbt tests dbt_test_
- python code run__

Reuse the base class for the DAGs.

2.4. Infrastructure

[TODO] to be added

2.5. Google Drive One-off reports

Create manual reports with SQL and store the query and metadata in this sheet External reports.
Store a copy of the generated file in the shared folder Majority Data Team/External reports.

3. How: Quality & Orchestration

This section defines the mandatory quality gates and processes for delivering data.

3.1. Formatting

dbt

make sure the decimal numbers are of type numeric
always round decimal numbers

3.2. Tests Standards

Python

In addition to following the general best practices for tests in the python documentation, the tests should: - after file generation: - check if the file is empty - check if the data in the columns uses the right data types using Pydantic

dbt

use the contract feature
- It enforces specifying data types for the columns
- You can add constraints for the columns, e.g., not null
There should be at least one test for each column in the model (except for metadata columns)
If the column can contain null values, it should be added to the documentation
for a set of values, use the accepted_values test
specify a unique key and tests for it
If possible, compare the data with other models
- add the test in the tests/reporting folder
- follow the model naming convention test_<recipient>_<description>
Check for the number of rows, try to come up with a relevant number of rows that the model should return. If the number of rows is not as expected, the test should fail. [TODO] check how to implement it

3.3. Required documentation

dt-documentation (here data product section)

Documentation template:

Use the doc-template.md to create the documentation for the data product.

dbt

documentation is required for each model, and columns in the model.

repositories

It should include technical documentation in the README.md file.

Template can be found in the dt-template-python-repo repository.

[TODO] add template for the README.md file, it should cover:

How it works
Services used
How to fix
What variables are needed
How to test (authentication required)
Avoid clutter documentation
Link to data-documentation (dt-documentation)