Implementing Data Quality Checks with Dataform Assertions

A hands-on guide to implementing data quality checks using Dataform assertions in Google Cloud BigQuery, covering validation rules, null checks, and automated data integrity testing.

Implementing data quality checks with Dataform assertions is a critical skill for managing data pipelines in Google Cloud Platform. This tutorial walks you through creating automated validation rules that catch data integrity issues early in your transformation process. You'll learn how to configure assertions that check for null values, ensure data completeness, and validate critical fields before data moves downstream in your pipeline.

Data quality problems can cascade through your entire analytics infrastructure. They lead to incorrect reports, failed downstream processes, and lost trust in your data systems. Dataform assertions provide an automated safety net that validates your data against defined rules as part of the transformation workflow. This capability matters for Professional Data Engineer exam preparation, as data quality and pipeline reliability are key topics covered in the certification.

Why Dataform Assertions Matter for Data Quality

Dataform is a development framework built on top of Google Cloud BigQuery that helps you manage SQL workflows and data transformations. Assertions in Dataform are specialized checks that automatically validate data quality conditions during the transformation process. Instead of manually writing separate validation queries or discovering data issues in production, assertions let you define quality rules directly in your transformation code.

When an assertion fails, Dataform immediately alerts you to the problem. You can address data integrity issues before they affect downstream consumers. This proactive approach to data quality works better than reactive troubleshooting after reports have already been generated with bad data.

Prerequisites and Requirements

Before you begin implementing data quality checks with Dataform assertions, ensure you have a Google Cloud Platform account with an active project and access to BigQuery with permissions to create datasets and tables. You'll need Dataform enabled in your GCP project and basic familiarity with SQL and data transformation concepts. Make sure you have a source table in BigQuery that you want to validate. Set aside approximately 30 minutes to complete this tutorial.

You'll need the BigQuery Data Editor role and permissions to create Dataform repositories. If you haven't used Dataform before, you can enable it through the Google Cloud Console by navigating to the Dataform section under Analytics services.

Understanding Dataform Assertions

Dataform assertions are validation rules that check your data against specific conditions. They run automatically as part of your data transformation workflow and can validate various data quality requirements. You can perform non-null validation to ensure critical fields are populated, run uniqueness checks to verify that identifier columns contain no duplicates, and define custom SQL conditions for complex business rules. You can also validate row count thresholds and check referential integrity between tables.

Assertions are defined in your Dataform configuration files using SQLX syntax, which combines SQL with configuration directives. When Dataform executes your workflow, it runs these assertions and reports any failures, preventing bad data from propagating through your pipeline.

Step 1: Set Up Your Dataform Repository

First, create a new Dataform repository in the Google Cloud Console. Navigate to Dataform in your GCP project and click Create Repository. Give your repository a meaningful name such as "data-quality-pipeline" and select your BigQuery region.

Once created, initialize a development workspace where you'll write your transformation code and assertions. Dataform uses Git for version control, so your configurations are tracked and can be reviewed through standard development workflows.

Create a new file in your repository called definitions/customer_transform.sqlx. This file will contain both your transformation logic and data quality assertions.

Step 2: Create a Basic Transformation with Assertions

Let's build a transformation that processes customer data with built-in quality checks. In your customer_transform.sqlx file, add the following configuration:

config {
  type: "table",
  schema: "analytics",
  name: "customers_validated",
  assertions: {
    nonNull: ["customer_id", "email", "created_date"],
    uniqueKey: ["customer_id"]
  }
}

SELECT
  customer_id,
  email,
  first_name,
  last_name,
  created_date,
  last_purchase_date,
  total_lifetime_value
FROM
  ${ref("raw_customers")}
WHERE
  is_deleted = false

This configuration accomplishes several things. The type: "table" directive tells Dataform to create a BigQuery table. The assertions block defines two quality checks: nonNull ensures that customer_id, email, and created_date columns never contain null values, while uniqueKey verifies that customer_id is unique across all rows.

The ${ref("raw_customers")} function creates a dependency on your source table, ensuring that Dataform executes transformations in the correct order.

Step 3: Implementing Non-Null Validation

Non-null assertions are among the most common data quality checks. They ensure that critical fields always contain values. For a subscription box service handling monthly deliveries, missing customer identifiers or contact information could prevent order fulfillment and customer communication.

The nonNull assertion checks every row in your transformed table. If any row contains a null value in the specified columns, the assertion fails and Dataform reports the validation error. This prevents downstream processes from consuming incomplete data.

You can test your assertion by intentionally creating a scenario with null values. Create a test file called definitions/customer_transform_test.sqlx:

config {
  type: "test",
  dataset: "customers_validated"
}

SELECT
  customer_id,
  email,
  created_date
FROM
  ${ref("customers_validated")}
WHERE
  customer_id IS NULL
  OR email IS NULL
  OR created_date IS NULL

This test query should return zero rows if your assertion is working correctly. If it returns any rows, those represent data quality violations that would trigger assertion failures.

Step 4: Adding Uniqueness Constraints

Uniqueness assertions validate that specified columns contain no duplicate values. For identifier columns like customer_id or order_id, duplicates indicate serious data integrity problems that could lead to incorrect aggregations or missing records.

The uniqueKey assertion in Dataform checks that the combination of specified columns is unique. For a single column, you simply provide it in an array. For composite keys, you can specify multiple columns:

config {
  type: "table",
  schema: "analytics",
  name: "order_items_validated",
  assertions: {
    nonNull: ["order_id", "item_id", "quantity"],
    uniqueKey: ["order_id", "item_id"]
  }
}

SELECT
  order_id,
  item_id,
  product_name,
  quantity,
  unit_price,
  total_price
FROM
  ${ref("raw_order_items")}

This configuration ensures that each combination of order_id and item_id appears only once in the table, which is appropriate for an order items table where each product should appear once per order.

Step 5: Creating Custom Assertion Queries

Beyond the built-in assertion types, Dataform allows you to write custom SQL assertions for business-specific validation rules. These custom assertions give you complete flexibility to implement complex data quality checks.

Create a new file called definitions/assertions/customer_validation.sqlx:

config {
  type: "assertion",
  name: "valid_email_format"
}

SELECT
  customer_id,
  email
FROM
  ${ref("customers_validated")}
WHERE
  NOT REGEXP_CONTAINS(email, r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')

This custom assertion validates email format using a regular expression. If the query returns any rows, the assertion fails because those rows contain invalid email addresses. Custom assertions should always return zero rows when data quality standards are met.

For a telehealth platform handling patient appointment scheduling, you might create an assertion ensuring appointment times fall within business hours:

config {
  type: "assertion",
  name: "appointments_within_business_hours"
}

SELECT
  appointment_id,
  scheduled_time,
  provider_id
FROM
  ${ref("appointments_validated")}
WHERE
  EXTRACT(HOUR FROM scheduled_time) < 8
  OR EXTRACT(HOUR FROM scheduled_time) >= 18
  OR EXTRACT(DAYOFWEEK FROM scheduled_time) IN (1, 7)

This assertion fails if any appointments are scheduled outside of 8 AM to 6 PM on weekdays, catching configuration errors or data entry mistakes.

Step 6: Execute and Monitor Your Assertions

After defining your transformations and assertions, you need to execute them in Google Cloud. In the Dataform interface, navigate to your workflow and click the "Start Execution" button. Dataform will compile your SQLX files into SQL queries, create the necessary BigQuery tables, and run all defined assertions.

During execution, Dataform displays real-time status for each action. Successful assertions show a green checkmark, while failures display error messages with details about which rows violated the quality rules. The execution log shows exactly how many rows failed each assertion, helping you quickly assess the scope of data quality issues.

You can also execute Dataform workflows programmatically using the command line. Install the Dataform CLI and authenticate with your GCP project, then run:

dataform run --tags assertion

This command executes only the assertions in your project, allowing you to validate data quality without running the full transformation pipeline.

Verification and Testing

To verify that your assertions are working correctly, you should test both success and failure scenarios. Start by confirming that your assertions pass with clean data. Execute your workflow and check that all assertions complete successfully.

Next, intentionally introduce data quality issues to verify that assertions catch them. You can do this by creating a test source table with known problems. For example, create a temporary table with null values in critical columns:

CREATE OR REPLACE TABLE `your-project.raw_data.test_customers` AS
SELECT
  1 AS customer_id,
  'test@example.com' AS email,
  CURRENT_DATE() AS created_date
UNION ALL
SELECT
  2 AS customer_id,
  NULL AS email,
  CURRENT_DATE() AS created_date;

Point your Dataform transformation at this test table and run the workflow. The nonNull assertion should fail, and you should see an error message indicating that email contains null values. This confirms that your quality checks are functioning as intended.

Real-World Application Examples

Data quality checks with Dataform assertions solve practical problems across various industries. Consider these real-world scenarios.

Smart Building Sensor Network

A smart building management company collects temperature, humidity, and occupancy data from thousands of sensors. They use Dataform assertions to validate sensor readings before feeding them into HVAC control systems. Their assertions check that temperature readings fall within physically possible ranges, sensor IDs match registered devices, and timestamps are recent enough to be actionable. This prevents faulty sensor data from triggering incorrect climate control decisions that could waste energy or compromise comfort.

Mobile Game Analytics Platform

A mobile game studio processes millions of player events daily to track engagement and monetization. They implement Dataform assertions to ensure player_id is never null, event timestamps are within the last 48 hours, and in-app purchase amounts are positive numbers. These checks caught a bug where some events were being logged with incorrect timestamps, which would have skewed their retention analysis and led to incorrect product decisions.

Agricultural Monitoring System

A precision agriculture platform analyzes soil moisture, weather conditions, and crop health data from farms across multiple regions. Their Dataform pipeline includes assertions validating that GPS coordinates fall within expected geographic boundaries, moisture readings are within sensor specifications, and each field has at least one measurement per day. When a sensor malfunction caused missing data for several fields, the assertion failures immediately alerted the operations team, who dispatched technicians before the data gap affected irrigation scheduling recommendations.

Common Issues and Troubleshooting

When implementing data quality checks with Dataform assertions, you may encounter several common issues. Understanding how to diagnose and resolve these problems will help you maintain reliable data pipelines.

Assertion Failures in Production

If assertions that passed in development suddenly fail in production, the root cause is usually upstream data changes. Check your source tables for schema modifications, null values introduced by new data sources, or changes in data collection processes. Review the BigQuery logs to identify when the data quality degraded.

Performance Issues with Complex Assertions

Custom assertions that scan large tables can slow down your pipeline execution. If assertions take too long, consider adding WHERE clauses to sample recent data, using partitioned tables to limit scan scope, or moving expensive validation checks to scheduled queries that run independently. Remember that assertions run synchronously, so they block downstream transformations until complete.

False Positive Assertion Failures

Sometimes assertions fail because the validation logic is too strict rather than because data quality is actually poor. For example, an email validation regex might reject valid international domains or newer top-level domains. Review failed rows carefully before assuming they represent real problems. You may need to adjust your assertion logic to accommodate legitimate edge cases.

Integration with Other GCP Services

Dataform assertions integrate naturally with the broader Google Cloud ecosystem. When assertions fail, you can configure Cloud Monitoring alerts to notify your team through email, Slack, or PagerDuty. Set up log-based metrics that track assertion failure rates over time, giving you visibility into data quality trends.

BigQuery itself provides the foundation for Dataform assertions. The validation queries execute as standard BigQuery SQL, so they benefit from BigQuery's performance optimizations and can use features like partitioned tables and clustered columns. You can query BigQuery's INFORMATION_SCHEMA to analyze assertion execution history and failure patterns.

For more complex workflows, combine Dataform with Cloud Composer (managed Apache Airflow) to orchestrate multi-step pipelines. Cloud Composer can trigger Dataform workflow executions, monitor assertion results, and branch to different downstream processes based on data quality outcomes. This integration allows you to implement sophisticated data quality gates where clean data flows to production analytics while problematic data routes to investigation and remediation workflows.

Cloud Data Fusion can feed data into tables that Dataform validates, creating end-to-end pipelines from data ingestion through transformation and quality validation. Dataflow jobs can also populate source tables, with Dataform assertions providing the quality checkpoint before data reaches your analytics layer.

Best Practices and Recommendations

When implementing data quality checks with Dataform assertions in production environments, follow these recommendations to ensure reliable and maintainable pipelines.

Start with assertions on critical fields only. Don't attempt to validate every column in every table immediately. Focus on identifiers, foreign keys, and business-critical fields that would cause significant problems if corrupted. You can gradually expand your assertion coverage as your team gains experience with the framework.

Document your assertion logic clearly. Use descriptive names for custom assertions and include comments explaining the business rationale behind each validation rule. Future team members need to understand what each assertion checks and why that check matters for data quality.

Monitor assertion failure rates over time. A sudden increase in failures might indicate upstream system problems, while a gradual increase could signal data drift or changing business processes. Set up dashboards that track assertion outcomes as key data quality metrics.

Consider assertion performance when designing validations. Full table scans on large datasets can significantly slow your pipeline. Use partitioning and clustering on your BigQuery tables to improve assertion query performance. For tables with billions of rows, consider sampling-based assertions that validate a representative subset rather than every record.

Implement a clear process for handling assertion failures. Define who gets notified, how quickly issues must be investigated, and whether failed assertions should block downstream processes or simply generate warnings. Different assertions may warrant different response procedures based on their criticality.

Next Steps and Enhancements

After implementing basic data quality checks with Dataform assertions, you can enhance your data validation capabilities in several ways. Explore row-level assertions that validate relationships between columns within the same row, such as ensuring that discount_amount never exceeds original_price or that end_date is always after start_date.

Investigate incremental models in Dataform that process only new or changed data. Incremental processing significantly improves performance for large datasets, and assertions can validate just the incremental portion while still ensuring overall data quality.

Consider implementing assertion thresholds for cases where some data quality issues are acceptable. Instead of failing when any rows violate a rule, you might allow up to 0.1% of rows to fail before triggering an alert. This approach works well for assertions checking data from unreliable external sources where perfect quality is unattainable.

Explore the Dataform documentation on advanced features like pre and post operations, which let you run SQL before or after table creation. You can use these hooks to collect data quality metrics, archive failed rows for analysis, or trigger notifications through custom integrations.

Monitoring and Maintaining Data Quality

Ongoing monitoring ensures that your data quality checks remain effective as your data pipeline evolves. Set up BigQuery scheduled queries that analyze assertion execution history and generate daily data quality reports. Track metrics like total assertions executed, failure rates by assertion type, and trends over time.

Create dashboards in Looker or Data Studio that visualize data quality metrics for stakeholders. Show the number of records processed, percentage passing quality checks, and common failure patterns. This visibility helps build organizational awareness of data quality and demonstrates the value of your validation infrastructure.

Regularly review and update your assertions as business requirements change. Schedule quarterly reviews where you assess whether existing assertions still provide value, identify new quality checks needed for recently added data sources, and remove obsolete validations that no longer apply.

Summary

You've now learned how to implement data quality checks with Dataform assertions in Google Cloud Platform. Starting with simple non-null and uniqueness validations, you progressed to custom SQL assertions that enforce complex business rules. You created transformations with embedded quality checks, tested assertion logic with intentional failures, and explored how these validations prevent data integrity issues from cascading through your analytics pipeline.

The skills you've built apply directly to building production data pipelines where automated quality validation is essential. You understand how assertions integrate with BigQuery, how to troubleshoot common issues, and how to design validation rules that balance thoroughness with performance. These capabilities are fundamental for anyone working with data transformation workflows in GCP.

For those preparing for the Professional Data Engineer certification, data quality and pipeline reliability are key exam topics. The practical experience gained from implementing Dataform assertions will help you understand the design patterns and best practices that the exam tests. Readers looking for comprehensive exam preparation can check out the Professional Data Engineer course, which covers data quality, pipeline orchestration, and the full range of Google Cloud data services you'll need to master for certification success.