dbt

dbt is our transformation tool for modeling data in BigQuery. It allows us to write SQL transformations, test data quality, and document our data warehouse.

Best Practices

SQL Styling

We follow the SQL Styling conventions on dbt

The styling is enforced by the sqlfluff linter.

Exceptions

Leading commas - Use leading commas for field lists
Line length - 140 characters maximum (excluding comments)

Documentation

Each model must have documentation. Specify partition and cluster column (to guide AI when writing SQL)
- if the model is Deprecated, mention it in the documentation with the date of deprecation, and the reason.
Models with 'lookml' tag are used to create a Looker view
- These models must have documentation for every column
Models with 'airflow' tag are run by a DAG outside of Frequent or Daily runs
- These models must have documentation for every column

Naming Conventions

Columns

Country Columns

3-Letter Country Code - suffix: _id
- Example: country_id with value USA

2-Letter Country Code - suffix: _code
- Example: country_code with value US

Full Country Name - suffix: _name
- Example: country_name with value United States

Date and Timestamp Columns

Date Columns - suffix: _date
- Example: order_date, signup_date

UTC Timestamps - suffix: _at
- Example: created_at, updated_at
- Exception: Source/S3 tables may use different conventions. These should be fixed in the _source model.

Timezone-Specific Timestamps - suffix: _at_<timezone>
- Example: passed_kyc_at_mt (Mountain Time)

Models

Seeds & Source Models

If the data is coming from a source, use the _source model to ingest the data. Include partition.
Do not use seeds directly in the models. Use the _source model to ingest the data. It allows for testing and documentation.
Add seeds documentation in the properties.yml file.

Fact and Dimension Models

Fact Models (Fact Tables)

Facts contain measurable, quantitative data:

Contain metrics and measurements
Usually numeric and additive (can be aggregated)
Represent business events or transactions (orders, sales, etc.)
Often large tables that grow continuously
Answer "how many/much" questions

Examples: fact_remittance, fact_transaction, fact_user_session

Dimension Models (Dimension Tables)

Dimensions contain descriptive attributes that provide context:

Contain text-based, categorical data
Represent business entities (customers, products, stores, etc.)
Change less frequently than fact tables
Have primary keys that fact tables reference
Answer "who/what/when/where/why" questions

Incremental Strategy

Add incremental strategy when possible for better performance. See more below.

Additional Requirements

Unique Keys

Every model must have a unique key
Use meaningful keys from the data source when possible (e.g., user_id if the level of detail is per user)

Materializations

Merge

When to Use:

Small tables (less than 1GB)
Data that can change within longer and undefined time frames (e.g., SCD2 or remittance)

Requirements:

This is dbt's default materialization
Unique key must be present
If the table is partitioned, use the script variable _dbt_max_partition:

{%- if is_incremental() %}
    and date(ingested_at) >= date(_dbt_max_partition)
{%- endif %}

Insert Overwrite

When to Use:

Preferred method for larger tables (considered mandatory over 10GB)
Reading from source
Requires partitioning

Configuration:

To optimize cost, use current_date_offset:

meta = {current_date_offset: 1},
partition_by = {
    'field': 'dw_ingestion_at', --the same column needs to be used in the incremental condition
    'data_type': 'timestamp',
    'granularity': 'day'
}

Note: The partition granularity doesn't need to be always day. Granularities month, or year are supported, too. Adjust the condition so the partition overwrites correctly. The minimum data per partition/cluster should be 100MB. Size Partition and cluster results

Incremental Condition:

{% if is_incremental() %}
where date(dw_ingestion_at) >= date(current_date() - interval {{get_current_date_offset()}} day)
{% endif %}

Rerunning Models

Some models (mostly stage models with partitions) have the variable current_date_offset. This variable changes the size of the partition being rerun and helps avoid rerunning the whole table.

Usage:

Add the variable at the end of your dbt command:

dbt run -m model_name --vars '{current_date_offset: 3}'

This will rerun the last 3 days of data instead of the entire table.