BigQuery
BigQuery is our data warehouse hosted on Google Cloud Platform. Prod | Dev
Finding Raw Events
All backend events are sent through Kinesis to one table in our BigQuery data warehouse:
For Prod: dw-prod-gwiiag.kinesis.backend_events
For Stage: dw-dev-s5dlij.kinesis.backend_events_stage
For Dev: dw-dev-s5dlij.kinesis.backend_events
Best Practices
- Use CTEs to filter data before joining with other tables. This will improve performance and readability.
- List the columns (avoid using
*) to improve readability and performance. - Always use the date filter. Extra points if you use the partition column to filter the data ⭐️. You can find it in table details.
- If the table is clustered, use the cluster column to filter the data and it must be the first filter. You can also find it in table details.
- All names should be in snake_case.
- If you need any help please feel free to ask us! 💬
Querying backend_events table
select *
from dw-prod-gwiiag.kinesis.backend_events
where event_ingested_at >= 'yyyy-mm-dd'
and event_name = 'EventName'
limit 100;
Warning
- Always filter by
event_ingested_atfirst - The table is partitioned by ingestion day for performance - Use
event_nameas secondary filter - The table is clustered by this field for faster queries - Add additional conditions after these two filters as needed
- If you need to check data in the event, try querying the
dbt_stage.event_name_as_snake_casetable.
Performance Monitoring
- Count Dashboard - Performance metrics
- Cost Dashboard - Query costs and usage
Importing ad-hoc data
Sometimes we might receive some data that we need to import into BigQuery for ad-hoc analysis or backfill purposes.
Before the import the file should be placed in the majority-bulk-upload in the dw-prod project. It should follow the naming convention: <date_of_export>_filename.
It can then be imported into dw-prod-gwiiag.bulk_upload dataset. You can use UI to import the data, for CLI use the instructions below. Table name should be the same as the filename (snake_case).
JSON Import
BigQuery only supports newline-delimited JSON (ndJSON) format. See Loading JSON data from Cloud Storage.
Convert JSON List to ndJSON
If you have a JSON list file, convert it to ndJSON:
Import with CLI:
bq load \
--autodetect \
--null_marker='' \
--source_format=NEWLINE_DELIMITED_JSON \
dw-prod-gwiiag.bulk_upload.20241114_table_name \
gs://majority-bulk-upload/20241114_table_name.json
Time Travel
BigQuery's time travel feature allows you to query historical data.
BigQuery retains historical data for: - 7 days by default - Check your table's time travel window before attempting recovery
select *
from dataset_name.table_name
for system_time as of timestamp_sub(current_timestamp(), interval 1 hour);
Common Use Cases
- Recover accidentally deleted data
- Compare current vs. historical states
- Audit data changes
- Restore tables to a previous state