Skip to content

Data Product Governance Guidelines

This document expands on the standards, scope, and quality guidelines for data products built and delivered by the Data Engineering team.


1. What: Data Product Scope

1.1. In-Scope Data Products

These guidelines apply to all data products, classified as follows:

  • Internal Events (EventHub):
    • see the list of supported events here
  • Internal Models & Files:
    • Astra eligible users
    • Restricted access
    • Check eligibility
  • ML/LLM Models:
    • risk score outputs
  • External APIs & DBs:
    • Sardine
    • Intercom batch and stream
    • Braze batch and stream
    • mParticle batch and stream
    • Cable
  • Email Deliveries:
    • Kard Impression

1.2. Out-of-Scope Systems

The following systems are explicitly out of scope for this specific governance framework:

  • Looker
  • Themis
  • WhatsApp project
  • Ad-hoc Notebooks
  • Events forwarded in Kinesis

2. Where: Code & Artifact Organization

This section defines where our code, data, and schemas must be stored and how they should be named.

2.1. GitHub Repositories

All data product code must be stored in its designated repository. The goal of the repository is to:

  • Fetch the data in unchanged format from Data Warehouse (sql select * from the table). Exclude metadata columns, e.g., report_date.
  • Save to the file/Send to external API
  • Test that the file is correct. See more in the test section.

Repositories must include:

  • README.md file with the following sections:
    • TODO
  • Use the code from data-lib and move the repetitive code to the library.

Repository structure:

  • dt-reporting:
    • Purpose: Manages all periodic, file-based data products.
    • Examples: SFTP deliveries, majority_rdfs, kard reports.
  • dt-bq2pubsub:
    • Purpose: Manages all data streamed via Pub/Sub.
    • Example: EventHub
  • API Repositories:
    • Purpose: Each external-facing API must be housed in its own dedicated repository.
    • Examples: sardine, intercom, braze, mparticle, talkdesk
  • internal files sent to Azure Blob Storage
    • Purpose: To trigger changes in the backend data warehouse
    • Examples: eligible_users, restricted_access, check_eligibility
    • To be Deprecated
  • ML models repositories:

2.2. dbt

In dbt, we categorize the models into 3 categories:

  • Reporting Models:
    • requires adding to the config schema = 'reporting'
    • stored in the models/reporting folder
    • Report models should be clearly named to match the product and recipient.
    • Pattern: rpt_<recipient>_<description> (e.g., rpt_majority_cards_accounts)
    • One model per file produced
      • can include intermediate models to help with the transformation, which should not be used in the marts.
      • keep naming convention for intermediate models and reporting models int_<recipient>_<description>
  • Event & Delta Models:
    • requires adding to the config schema = 'reverse_etl'
    • stored in the models/reverse_etl folder
    • Event-based models should follow the pattern:
    • Pattern:
      • <recipient>_<event_name> (internal for the event hub events, e.g internal_crde_user_model_updated_event)
      • <recipient>_<event_name>_delta
      • <recipient>_<event_name>_previous_run (created in the python code)

2.3. Airflow DAGs

Make sure to follow the guidelines for the DAGs in the airflow documentation. Additionally, each DAG:

  • Runs the dbt models
  • Runs the tests and fails the DAG if the tests fail
  • Reuses the reporting class
  • Has 0 retries
  • Has a timeout (to consider)
  • Variables - don't overuse. No secrets can be hardcoded

Naming convention:

  • DAG name: <recipient>_<description(reports/events/transactions/etc.)>_<API/SFTP/PubSub/Blob>
  • Task name
    • dbt freshness dbt_freshness_ (advisable to add)
    • dbt dbt_run_
    • dbt tests dbt_test_
    • python code run__

Reuse the base class for the DAGs.

2.4. Infrastructure

[TODO] to be added

3. How: Quality & Orchestration

This section defines the mandatory quality gates and processes for delivering data.

3.1. Formatting

dbt

  • make sure the decimal numbers are of type numeric
  • always round decimal numbers

3.2. Tests Standards

Python

In addition to following the general best practices for tests in the python documentation, the tests should: - after file generation: - check if the file is empty - check if the data in the columns uses the right data types using Pydantic

dbt

  • use the contract feature
    • It enforces specifying data types for the columns
    • You can add constraints for the columns, e.g., not null
  • There should be at least one test for each column in the model (except for metadata columns)
  • If the column can contain null values, it should be added to the documentation
  • for a set of values, use the accepted_values test
  • specify a unique key and tests for it
  • If possible, compare the data with other models
    • add the test in the tests/reporting folder
    • follow the model naming convention test_<recipient>_<description>
  • Check for the number of rows, try to come up with a relevant number of rows that the model should return. If the number of rows is not as expected, the test should fail. [TODO] check how to implement it

3.3. Required documentation

dt-documentation (here data product section)

Documentation template:

Use the doc-template.md to create the documentation for the data product.

dbt

documentation is required for each model, and columns in the model.

repositories

It should include technical documentation in the README.md file.

Template can be found in the dt-template-python-repo repository.

[TODO] add template for the README.md file, it should cover:

  • How it works
  • Services used
  • How to fix
  • What variables are needed
  • How to test (authentication required)
  • Avoid clutter documentation
  • Link to data-documentation (dt-documentation)