BigQuery

BigQuery is our data warehouse hosted on Google Cloud Platform. Prod | Dev

Finding Raw Events

All backend events are sent through Kinesis to one table in our BigQuery data warehouse:

For Prod: dw-prod-gwiiag.kinesis.backend_events

For Stage: dw-dev-s5dlij.kinesis.backend_events_stage

For Dev: dw-dev-s5dlij.kinesis.backend_events

Best Practices

Use CTEs to filter data before joining with other tables. This will improve performance and readability.
List the columns (avoid using *) to improve readability and performance.
Always use the date filter. Extra points if you use the partition column to filter the data ⭐️. You can find it in table details.
If the table is clustered, use the cluster column to filter the data and it must be the first filter. You can also find it in table details.
All names should be in snake_case.
If you need any help please feel free to ask us! 💬

Querying backend_events table

select *
from dw-prod-gwiiag.kinesis.backend_events
where event_ingested_at >= 'yyyy-mm-dd'
  and event_name = 'EventName'
limit 100;

Warning

Always filter by event_ingested_at first - The table is partitioned by ingestion day for performance
Use event_name as secondary filter - The table is clustered by this field for faster queries
Add additional conditions after these two filters as needed
If you need to check data in the event, try querying the dbt_stage.event_name_as_snake_case table.

Performance Monitoring

Count Dashboard - Performance metrics
Cost Dashboard - Query costs and usage

Importing ad-hoc data

Sometimes we might receive some data that we need to import into BigQuery for ad-hoc analysis or backfill purposes. Before the import the file should be placed in the majority-bulk-upload in the dw-prod project. It should follow the naming convention: <date_of_export>_filename. It can then be imported into dw-prod-gwiiag.bulk_upload dataset. You can use UI to import the data, for CLI use the instructions below. Table name should be the same as the filename (snake_case).

JSON Import

BigQuery only supports newline-delimited JSON (ndJSON) format. See Loading JSON data from Cloud Storage.

Convert JSON List to ndJSON

If you have a JSON list file, convert it to ndJSON:

jq -c '.[]' json_list.json > result.ndjson

Import with CLI:

bq load \
  --autodetect \
  --null_marker='' \
  --source_format=NEWLINE_DELIMITED_JSON \
  dw-prod-gwiiag.bulk_upload.20241114_table_name \
  gs://majority-bulk-upload/20241114_table_name.json

Time Travel

BigQuery's time travel feature allows you to query historical data.

BigQuery retains historical data for: - 7 days by default - Check your table's time travel window before attempting recovery

select *
from dataset_name.table_name
for system_time as of timestamp_sub(current_timestamp(), interval 1 hour);

Common Use Cases

Recover accidentally deleted data
Compare current vs. historical states
Audit data changes
Restore tables to a previous state