Skip to content

File Ingestion

All SFTP-based file ingestion runs through the unified dt-file-ingestion image, orchestrated by Airflow.

Providers

Checkout

Auth SSH key (checkout-ssh-key Airflow variable, raw PEM)
DAG checkout_sftp_download
Schedule 30 */4 * * * (every 4 hours at :30)
GCS bucket checkout-sftp-{env}
Processor raw

Reports:

Report SFTP path Pattern
checkout_financial_actions /majority-usa-llc/reports-majority-usa-llc/financial-actions/payout-id *.csv
checkout_financial_actions_fees /majority-usa-llc/reports-majority-usa-llc/financial-actions/date-range *.csv
checkout_payouts /majority-usa-llc/reports-majority-usa-llc/payouts *.csv

InComm

Auth Password (incomm-sftp-password-key Airflow variable, username: majority_prod)
DAG incomm_sftp_download
Schedule 30 9 * * * (daily at 09:30 UTC)
GCS bucket incomm-sftp-{env}
Processor raw

Reports:

Report SFTP path Pattern Notes
incomm_cashtie_billing_report /reports 2*/*.csv Recursive listing (date-named subdirectories)
incomm_swipe_report /reports/spil_reports *.csv UTF-8 BOM and trailing blank lines stripped

The DAG includes a downstream dbt step: dbt build --select tag:incomm --target airflow_federated.


Lithic

Auth SSH key (lithic-sftp-private-key Airflow variable)
DAG lithic_sftp_download
Schedule 0 13 * * * (daily at 13:00 UTC)
GCS bucket lithic-sftp-{env}
Processor raw

Reports:

Report SFTP path Pattern
settlement_detail /lithic_reports *_settlement_detail.csv
cards /lithic_reports *_cards.csv
daily_network_settlement_summary /lithic_reports *_daily_network_settlement_summary.csv
accounts /lithic_reports *_accounts.csv
card_transactions /lithic_reports *_card_transactions.csv
network_reports /network-reports *.txt

Network reports run sequentially after all lithic report downloads complete.


CFSB

Auth SSH key (cfsb-sftp-pkey Airflow variable) + PGP decryption (cfsb-majority-pgp-private-key)
DAG cfsb_sftp_download
Schedule 15 9 * * * (daily at 09:15 UTC)
GCS bucket cfsb-sftp-{env}
Processor pgp

Reports:

Report SFTP path Pattern
cfsb_transactions_reconciliation cfsb_transactions_reconciliation/ TXNDDA_MAJORITY_*.csv.pgp

WebBank

Auth SSH key (webbank-ssh-privatekey) + GPG decryption (webbank-pgp-privatekey)
DAG webbank_sftp_download
Schedule 0 14 * * * (daily at 14:00 UTC)
GCS bucket webbank-sftp-{env}
Processor bai2

Majority Ledger

Source Azure Blob Storage (not SFTP)
DAG majority_ledger_blob_download
Schedule 0 * * * * (hourly)
GCS bucket majority-ledger-blob-{env}
Processor raw

Downloads LedgerTransactionsHistory/*.csv from Azure Blob Storage (prodmajorityreporting / stagemajorityreporting storage accounts, reports container).


ATM All Points (deprecated)

Deprecated

This pipeline is migrated but not used -- dbt references were removed ~18 months ago. It still runs weekly.

Auth Password (atm-ftp-username, atm-ftp-password Airflow variables)
DAG atm_all_points_sftp_download
Schedule 0 19 * * 0 (Sundays at 19:00 UTC)
GCS bucket atm-all-points-sftp-{env}
Processor raw

Reports:

Report SFTP path Pattern
allpoint_geo_tid_all / allpoint_geo_tid_all.csv

Dev environment cannot connect to this SFTP server -- only testable in prod.


AWS (legacy)

Stale but not yet deleted

The following S3 buckets previously used for file ingestion are now stale. Active ingestion targets GCS.

Provider S3 bucket(s) Status
Checkout checkout-majority-{env}, psp-funding-reconciliation-majority-{env} To be deleted
InComm incomm-report-majority-{env} To be deleted
ATM All Points atm-allpoints-majority-{env} To be deleted

How it works

flowchart LR
    SFTP["SFTP Server"] -->|dt-file-ingestion| GCS["GCS Bucket"]
    GCS -->|PubSub notification| CF["dt-gcp-bq-ingestion\ncloud function"]
    CF --> BQ["BigQuery"]
Hold "Alt" / "Option" to enable pan & zoom
  1. Airflow triggers a KubernetesPodOperatorWithCredentials running the dt-file-ingestion image
  2. The image connects to the provider's SFTP server and downloads files matching configured patterns
  3. Files are uploaded to a provider-specific GCS bucket
  4. A PubSub notification (gcs-file-upload-topic) triggers the dt-gcp-bq-ingestion-cloud-function
  5. The cloud function loads the file into BigQuery, adding metadata columns: ingested_at, file_name, bucket_name