dbt
dbt is our transformation tool for modeling data in BigQuery. It allows us to write SQL transformations, test data quality, and document our data warehouse.
Best Practices
SQL Styling
We follow the SQL Styling conventions on dbt
The styling is enforced by the sqlfluff linter.
Exceptions
- Leading commas - Use leading commas for field lists
- Line length - 140 characters maximum (excluding comments)
Documentation
- Each model must have documentation. Specify partition and cluster column (to guide AI when writing SQL)
- if the model is Deprecated, mention it in the documentation with the date of deprecation, and the reason.
- Models with
'lookml'tag are used to create a Looker view- These models must have documentation for every column
- Models with
'airflow'tag are run by a DAG outside of Frequent or Daily runs- These models must have documentation for every column
Naming Conventions
Columns
Country Columns
- 3-Letter Country Code - suffix:
_id- Example:
country_idwith valueUSA
- Example:
- 2-Letter Country Code - suffix:
_code- Example:
country_codewith valueUS
- Example:
- Full Country Name - suffix:
_name- Example:
country_namewith valueUnited States
- Example:
Date and Timestamp Columns
-
Date Columns - suffix:
_date- Example:
order_date,signup_date
- Example:
- UTC Timestamps - suffix:
_at- Example:
created_at,updated_at - Exception: Source/S3 tables may use different conventions. These should be fixed in the
_sourcemodel.
- Example:
- Timezone-Specific Timestamps - suffix:
_at_<timezone>- Example:
passed_kyc_at_mt(Mountain Time)
- Example:
Models
Seeds & Source Models
- If the data is coming from a source, use the
_sourcemodel to ingest the data. Include partition. - Do not use seeds directly in the models. Use the
_sourcemodel to ingest the data. It allows for testing and documentation. - Add seeds documentation in the
properties.ymlfile.
Fact and Dimension Models
Fact Models (Fact Tables)
Facts contain measurable, quantitative data:
- Contain metrics and measurements
- Usually numeric and additive (can be aggregated)
- Represent business events or transactions (orders, sales, etc.)
- Often large tables that grow continuously
- Answer "how many/much" questions
Examples: fact_remittance, fact_transaction, fact_user_session
Dimension Models (Dimension Tables)
Dimensions contain descriptive attributes that provide context:
- Contain text-based, categorical data
- Represent business entities (customers, products, stores, etc.)
- Change less frequently than fact tables
- Have primary keys that fact tables reference
- Answer "who/what/when/where/why" questions
Incremental Strategy
- Add incremental strategy when possible for better performance. See more below.
Additional Requirements
Unique Keys
- Every model must have a unique key
- Use meaningful keys from the data source when possible (e.g.,
user_idif the level of detail is per user)
Materializations
Merge
When to Use:
- Small tables (less than 1GB)
- Data that can change within longer and undefined time frames (e.g., SCD2 or remittance)
Requirements:
- This is dbt's default materialization
- Unique key must be present
- If the table is partitioned, use the script variable
_dbt_max_partition:
Insert Overwrite
When to Use:
- Preferred method for larger tables (considered mandatory over 10GB)
- Reading from source
- Requires partitioning
Configuration:
To optimize cost, use current_date_offset:
meta = {current_date_offset: 1},
partition_by = {
'field': 'dw_ingestion_at', --the same column needs to be used in the incremental condition
'data_type': 'timestamp',
'granularity': 'day'
}
Note: The partition granularity doesn't need to be always day. Granularities month, or year are supported, too. Adjust the condition so the partition overwrites correctly. The minimum data per partition/cluster should be 100MB. Size Partition and cluster results
Incremental Condition:
{% if is_incremental() %}
where date(dw_ingestion_at) >= date(current_date() - interval {{get_current_date_offset()}} day)
{% endif %}
Rerunning Models
Some models (mostly stage models with partitions) have the variable current_date_offset. This variable changes the size of the partition being rerun and helps avoid rerunning the whole table.
Usage:
Add the variable at the end of your dbt command:
This will rerun the last 3 days of data instead of the entire table.