Skip to content

dbt

dbt is our transformation tool for modeling data in BigQuery. It allows us to write SQL transformations, test data quality, and document our data warehouse.

Best Practices

SQL Styling

We follow the SQL Styling conventions on dbt

The styling is enforced by the sqlfluff linter.

Exceptions

  • Leading commas - Use leading commas for field lists
  • Line length - 140 characters maximum (excluding comments)

Documentation

  • Each model must have documentation. Specify partition and cluster column (to guide AI when writing SQL)
    • if the model is Deprecated, mention it in the documentation with the date of deprecation, and the reason.
  • Models with 'lookml' tag are used to create a Looker view
    • These models must have documentation for every column
  • Models with 'airflow' tag are run by a DAG outside of Frequent or Daily runs
    • These models must have documentation for every column

Naming Conventions

Columns

Country Columns

  • 3-Letter Country Code - suffix: _id
    • Example: country_id with value USA
  • 2-Letter Country Code - suffix: _code
    • Example: country_code with value US
  • Full Country Name - suffix: _name
    • Example: country_name with value United States

Date and Timestamp Columns

  • Date Columns - suffix: _date

    • Example: order_date, signup_date
  • UTC Timestamps - suffix: _at
    • Example: created_at, updated_at
    • Exception: Source/S3 tables may use different conventions. These should be fixed in the _source model.
  • Timezone-Specific Timestamps - suffix: _at_<timezone>
    • Example: passed_kyc_at_mt (Mountain Time)

Models

Seeds & Source Models
  • If the data is coming from a source, use the _source model to ingest the data. Include partition.
  • Do not use seeds directly in the models. Use the _source model to ingest the data. It allows for testing and documentation.
  • Add seeds documentation in the properties.yml file.
Fact and Dimension Models

Fact Models (Fact Tables)

Facts contain measurable, quantitative data:

  • Contain metrics and measurements
  • Usually numeric and additive (can be aggregated)
  • Represent business events or transactions (orders, sales, etc.)
  • Often large tables that grow continuously
  • Answer "how many/much" questions

Examples: fact_remittance, fact_transaction, fact_user_session

Dimension Models (Dimension Tables)

Dimensions contain descriptive attributes that provide context:

  • Contain text-based, categorical data
  • Represent business entities (customers, products, stores, etc.)
  • Change less frequently than fact tables
  • Have primary keys that fact tables reference
  • Answer "who/what/when/where/why" questions

Incremental Strategy

  • Add incremental strategy when possible for better performance. See more below.

Additional Requirements

Unique Keys

  • Every model must have a unique key
  • Use meaningful keys from the data source when possible (e.g., user_id if the level of detail is per user)

Materializations

Merge

When to Use:

  • Small tables (less than 1GB)
  • Data that can change within longer and undefined time frames (e.g., SCD2 or remittance)

Requirements:

  • This is dbt's default materialization
  • Unique key must be present
  • If the table is partitioned, use the script variable _dbt_max_partition:
{%- if is_incremental() %}
    and date(ingested_at) >= date(_dbt_max_partition)
{%- endif %}

Insert Overwrite

When to Use:

  • Preferred method for larger tables (considered mandatory over 10GB)
  • Reading from source
  • Requires partitioning

Configuration:

To optimize cost, use current_date_offset:

meta = {current_date_offset: 1},
partition_by = {
    'field': 'dw_ingestion_at', --the same column needs to be used in the incremental condition
    'data_type': 'timestamp',
    'granularity': 'day'
}

Note: The partition granularity doesn't need to be always day. Granularities month, or year are supported, too. Adjust the condition so the partition overwrites correctly. The minimum data per partition/cluster should be 100MB. Size Partition and cluster results

Incremental Condition:

{% if is_incremental() %}
where date(dw_ingestion_at) >= date(current_date() - interval {{get_current_date_offset()}} day)
{% endif %}

Rerunning Models

Some models (mostly stage models with partitions) have the variable current_date_offset. This variable changes the size of the partition being rerun and helps avoid rerunning the whole table.

Usage:

Add the variable at the end of your dbt command:

dbt run -m model_name --vars '{current_date_offset: 3}'

This will rerun the last 3 days of data instead of the entire table.