Data transformation is a crucial step in any data pipeline. dbt (data build tool) streamlines this process, but running dbt jobs in a scheduled and automated manner can be complex. This tech guide shows you how to use a powerful trio: Google Workflows, Cloud Scheduler, and Cloud Run. Discover how these applications work together seamlessly to execute dbt commands with dynamic parameter overrides.
Why This Setup?
This combination offers multiple advantages:
- Simplified Workflow Management: Google Workflows provides a visual interface for building and orchestrating multi-step workflows. Think of it as the conductor coordinating the entire process.
- Scheduled Execution: Cloud Scheduler triggers the workflow at predefined intervals, ensuring your dbt jobs run on time every time.
- Serverless Execution: Cloud Run offers a serverless platform to run containerised dbt tasks. It scales automatically and eliminates server management overhead.
- Dynamic Parameter Control: We’ll explore how to dynamically override dbt parameters within the Cloud Run job, adding flexibility to your workflows.
Let’s get familiar with the basic architecture below. You’ll discover how to use Cloud Workflows along with dbt. You’ll also learn how to share parameters between them dynamically. To finish, you’ll see how to create a Cloud Scheduler to trigger your workflow.
Building the Pipeline
Here’s a step-by-step breakdown of setting up this pipeline:
- Cloud Run Job Setup:
Follow our previous post to learn more about Cloud Run Job Creation. In this example we will set up with dbt run –select staging command.
- Workflow Creation:
In Google Workflows, create a new workflow with three simple steps: init, dbt_run, and finish.
Here’s where the magic happens: you define and establish your unique placeholders for each run (see step 1 in the image below). This example shows how to insert random data with audit columns so you can track the load datetime, job run (the workflow name), and run ID of each execution.
In step 2, you set up the variables according to dbt standards and build the new dbt command (dbt_container_arg).
Important: Your dbt SQL file should have an alias that matches the ones you set up in the placeholders section. For example:
,{{ var(‘_ctl_tsp_load_datetime_’ ) }} AS ctl_tsp_load_datetime
Step 3 in the image shows how to override the dbt arguments. This will simply override the “Container arguments” from the Cloud Run job with the brand new dbt command.
Running the Pipeline
Once you have everything configured, you can run the pipeline! Here’s how:
- Workflow Execution: Trigger the workflow manually or let Cloud Scheduler handle it based on the defined schedule.
- Cloud Run Job Execution: The workflow triggers the Cloud Run job, which then executes the dbt command with the dynamically overridden parameters.
After the job succeeds, check your BigQuery results!
Here’s an example of a manual execution:
However, you can also create a Cloud Scheduler to run your job/workflow on a recurring schedule.
Here is an example of creating a scheduler that runs daily at 8 AM and triggers the workflow:
Your scheduler will then be ready to run, like this below. You can also trigger it manually by selecting the scheduler and clicking “Force run”.
Note:
To run this pipeline successfully, you probably need to change your entrypoint.sh file to match the number of arguments in your dbt command.
You can find more details about this entrypoint in our previous article, “How to set up a Cloud Run Job to Read and Write on BigQuery”
In this case, our entrypoint.sh will look like this:
Benefits and Use Cases
This approach offers several benefits:
- Automated dbt Execution: Schedules dbt runs for a truly hands-off approach.
- Dynamic Control: Adjusts dbt behaviour based on workflow parameters, making the pipeline adaptable.
- Serverless Efficiency: Uses serverless technologies for cost-effective and scalable execution.
This setup is ideal for various use cases, including:
- Daily Data Refresh: Schedules dbt jobs to update data models daily.
- Environment-Specific Transformations: Uses workflow parameters to define different dbt configurations for development, staging, and production environments.
- Conditional Data Processing: Controls dbt behaviour based on specific conditions passed through workflow parameters.
Conclusion
By combining Google Workflows, Cloud Scheduler, and Cloud Run, you can create a powerful and flexible system for running dbt jobs. This serverless approach makes workflow management easier, automates execution, and allows you to dynamically control dbt parameters. With this setup, your data transformations will be completed efficiently and reliably. It will help your organisation make better data-driven decisions.
Your success starts here
What’s your Google Cloud challenge?