CTEs vs Temporary Tables in BigQuery: Which to Choose
A practical guide to understanding the differences between Common Table Expressions (CTEs) and temporary tables in BigQuery, helping you choose the right approach for your query optimization needs.
When writing complex queries in BigQuery, Google Cloud's enterprise data warehouse, you'll often need to break down logic into manageable pieces. Two common approaches for structuring multi-step queries are Common Table Expressions (CTEs) and temporary tables. Understanding when to use CTEs vs temporary tables in BigQuery can significantly impact both your query performance and code maintainability.
The decision between these two approaches isn't always straightforward. Both let you organize query logic, but they differ fundamentally in how BigQuery processes them, how long they persist, and what kinds of optimization opportunities they create. For a climate research organization processing daily weather station readings or a streaming service analyzing viewing patterns across millions of subscribers, making the right choice affects query costs, execution time, and how easily your team can maintain the analytics pipeline.
Understanding CTEs in BigQuery
A Common Table Expression is a named subquery that you define within a single query statement using the WITH clause. Think of it as a named result set that exists only for the duration of that specific query. When you write a CTE in BigQuery, you're essentially creating a reference point within your SQL that makes complex logic more readable.
Here's a practical example from a solar energy company tracking panel performance across multiple installations:
WITH daily_output AS (
SELECT
installation_id,
DATE(timestamp) as output_date,
SUM(kwh_generated) as total_kwh,
AVG(panel_temperature) as avg_temp
FROM solar_readings
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY installation_id, output_date
),
performance_baseline AS (
SELECT
installation_id,
AVG(total_kwh) as expected_kwh,
STDDEV(total_kwh) as kwh_variance
FROM daily_output
GROUP BY installation_id
)
SELECT
d.installation_id,
d.output_date,
d.total_kwh,
b.expected_kwh,
(d.total_kwh - b.expected_kwh) / b.expected_kwh as performance_deviation
FROM daily_output d
JOIN performance_baseline b ON d.installation_id = b.installation_id
WHERE ABS(d.total_kwh - b.expected_kwh) / b.expected_kwh > 0.15
ORDER BY d.output_date DESC, d.installation_id;
In this query, the two CTEs break down the logic clearly. The first aggregates daily output, and the second calculates baseline expectations. BigQuery's optimizer can see the entire query structure and make decisions about how to execute it most efficiently.
How Temporary Tables Work in BigQuery
Temporary tables in BigQuery are actual tables that persist in your dataset for a limited time, typically 24 hours. When you create a temporary table, BigQuery writes the results to storage, and subsequent queries read from that materialized result. This approach physically separates query stages rather than keeping everything within a single statement.
Consider a genomics laboratory processing DNA sequencing data. They might use temporary tables like this:
CREATE TEMP TABLE sequence_variants AS
SELECT
sample_id,
chromosome,
position,
reference_base,
alternate_base,
quality_score
FROM raw_sequencing_data
WHERE quality_score > 30
AND coverage_depth >= 10;
CREATE TEMP TABLE high_confidence_mutations AS
SELECT
sv.sample_id,
sv.chromosome,
sv.position,
sv.alternate_base,
gd.gene_name,
gd.clinical_significance
FROM sequence_variants sv
JOIN gene_definitions gd
ON sv.chromosome = gd.chromosome
AND sv.position BETWEEN gd.start_position AND gd.end_position
WHERE gd.clinical_significance IS NOT NULL;
SELECT
gene_name,
COUNT(DISTINCT sample_id) as affected_samples,
ARRAY_AGG(STRUCT(sample_id, position, alternate_base) ORDER BY sample_id LIMIT 10) as example_mutations
FROM high_confidence_mutations
GROUP BY gene_name
HAVING COUNT(DISTINCT sample_id) >= 3
ORDER BY affected_samples DESC;
Each temporary table materializes its results. If you need to query these intermediate results multiple times or want to inspect them separately for debugging, this approach provides clear separation points.
Query Optimization and Performance Characteristics
The way BigQuery's query optimizer handles CTEs versus temporary tables differs in important ways that affect performance and cost. When you use CTEs, the BigQuery optimizer sees your entire query as a single unit. This visibility allows it to perform optimizations like predicate pushdown, where filter conditions from your final SELECT can be pushed down into earlier CTE definitions to reduce the data processed at each stage.
For example, if a payment processor is analyzing transaction patterns, a CTE that aggregates millions of transactions might only need to process a subset if the final query filters to specific merchant categories. BigQuery can recognize this and optimize accordingly.
However, there's an important consideration with CTEs in BigQuery. Unlike some other database systems, BigQuery may execute a CTE multiple times if you reference it in multiple places within your query. If you have a computationally expensive CTE that's referenced three times in subsequent joins, BigQuery might recalculate it three times rather than computing it once and reusing the result.
Temporary tables avoid this issue because they materialize results exactly once. When a pharmaceutical company is joining clinical trial data across multiple dimensions, a temporary table ensures that a complex aggregation runs only once, even if that result feeds into several downstream joins or unions.
When CTEs Make the Most Sense
CTEs excel when you're writing queries that follow a logical progression of transformations and you need each intermediate step only once. They shine in scenarios where readability and maintainability matter more than the absolute fastest execution time, and where the query optimizer benefits from seeing the full picture.
A mobile game studio analyzing player progression might use CTEs to track how players move through game levels. The query breaks down logically: first identify session boundaries, then calculate level completion rates, then compare against cohort averages. Each step feeds cleanly into the next, and the single query structure makes the logic easy to review during code reviews.
CTEs work particularly well when your intermediate results are relatively small or when filter conditions in later stages can significantly reduce the data processed earlier. A hospital network analyzing appointment scheduling efficiency might aggregate thousands of appointments into department-level statistics, then filter to departments showing unusual patterns. The optimizer can push those filters down effectively.
For ad-hoc analysis where you're iterating on query logic, CTEs provide faster development cycles. You can modify one part of the WITH clause without restructuring multiple CREATE TABLE statements, making exploration more fluid.
When Temporary Tables Are the Better Choice
Temporary tables become valuable when you need to reference the same intermediate result multiple times or when you're working with very large intermediate results that would be expensive to recalculate. They're also useful when you want to inspect intermediate steps separately, either for debugging or for quality checks between stages.
A freight logistics company processing GPS tracking data from thousands of trucks might create a temporary table of cleaned, geocoded positions first. This preprocessing step might involve complex spatial calculations and data quality filters. Once materialized, multiple downstream analyses can use these cleaned positions without repeating the expensive geocoding operations.
When working with massive datasets where intermediate results are large and used multiple times, temporary tables prevent redundant computation. A telecommunications company analyzing network traffic patterns might aggregate petabytes of connection logs down to hourly summaries by cell tower. If that summary feeds into three different analyses about coverage quality, capacity planning, and anomaly detection, materializing it once as a temporary table makes more sense than letting a CTE recalculate it three times.
Temporary tables also provide clear checkpoints in complex, multi-stage processes. If you're running a lengthy data transformation pipeline and something fails partway through, you can inspect which temporary tables were created successfully and troubleshoot from there. A research institution processing satellite imagery might create temporary tables for each processing stage, making it easier to validate that each transformation produces expected results before proceeding.
Memory and Resource Considerations in Google Cloud
BigQuery processes queries using a distributed architecture across Google Cloud infrastructure. CTEs consume resources differently than temporary tables because they exist only in the context of query execution. When you use CTEs, BigQuery allocates memory and compute slots for the entire query, and those resources are released when the query completes.
Temporary tables, by contrast, write data to BigQuery storage. This storage is billed according to standard BigQuery storage pricing, although temporary tables are automatically deleted after 24 hours. For a video streaming platform generating massive daily reports, this distinction matters. If intermediate results are several terabytes, storing them as temporary tables incurs storage costs for up to 24 hours.
The choice also affects slot utilization. A single query with CTEs uses slots continuously until completion. Breaking work into temporary tables means each CREATE TEMP TABLE statement is a separate job that uses slots and then releases them. For organizations with reserved slot commitments in GCP, this can affect how effectively you utilize your purchased capacity.
Practical Patterns and Best Practices
Many production BigQuery workflows in Google Cloud combine both approaches strategically. A common pattern is using CTEs for queries with linear, sequential logic where each step naturally feeds into the next, while using temporary tables when intermediate results branch into multiple independent analyses.
Consider an online learning platform analyzing student engagement. They might use a temporary table to create a cleaned, sessionized view of activity logs that multiple departments query. The marketing team writes queries with CTEs analyzing signup funnels, while the product team uses CTEs to explore feature usage patterns, both starting from the same temporary table of sessionized data.
When debugging complex queries, you can convert CTEs to temporary tables temporarily to inspect intermediate results. This technique helps verify that each transformation step produces expected output. Once validated, you might convert some back to CTEs for cleaner production code.
For scheduled queries running regularly in BigQuery, consider whether intermediate results have value beyond a single query execution. A retail chain running nightly inventory analysis might benefit from temporary tables that persist long enough for multiple stakeholders to run their own queries against standardized intermediate datasets.
Testing and Development Workflow
The development experience differs between CTEs and temporary tables in ways that affect productivity. When building complex analytical queries in the BigQuery console or through the GCP command line tools, CTEs let you iterate quickly. You can modify the final SELECT statement and rerun the query to see how results change, knowing that all the logic is contained in one place.
Temporary tables require a more deliberate approach. You execute each CREATE TEMP TABLE statement, verify the results, then build subsequent queries. This process can be more methodical but also more time-consuming during initial development. For a public health department building queries to analyze COVID testing data during rapidly evolving requirements, the agility of CTEs might outweigh other considerations during initial development.
However, temporary tables can speed up iteration when you're refining later stages of a pipeline. Once you've created temporary tables for expensive early stages, you can modify and rerun later queries without reprocessing the entire dataset. A scientific computing lab processing telescope observations might create temporary tables for the computationally intensive image processing steps, then iterate quickly on the statistical analysis queries that follow.
Certification and Professional Development Context
Understanding when to use CTEs versus temporary tables in BigQuery is relevant to the Professional Data Engineer certification from Google Cloud. The exam covers BigQuery query optimization, data warehouse design patterns, and practical considerations for building efficient data pipelines in GCP. This topic also appears in the context of the Professional Cloud Architect certification when designing analytics solutions that balance performance, cost, and maintainability.
Making the Decision in Practice
The choice between CTEs and temporary tables in BigQuery ultimately depends on your specific requirements. Use CTEs when you want readable, maintainable queries for linear transformations where intermediate steps are used once. They work well for ad-hoc analysis, reports with straightforward logic, and situations where BigQuery's query optimizer can benefit from seeing the full query structure.
Choose temporary tables when intermediate results are referenced multiple times, when you're working with very large datasets where materialization prevents redundant computation, or when you need to inspect intermediate steps separately. They're valuable for complex ETL processes, multi-stage transformations with branching logic, and scenarios where different teams query the same intermediate results.
In many real-world scenarios on Google Cloud, the best approach combines both techniques. A transportation authority analyzing subway ridership might use temporary tables to create core datasets of cleaned turnstile readings, then use CTEs within specific analytical queries to explore passenger flow patterns, peak usage times, or service disruption impacts. This hybrid approach balances the strengths of each technique while avoiding their limitations.
As you design queries and data pipelines in BigQuery, consider not just immediate execution requirements but also how your code will be maintained, debugged, and evolved. The most elegant solution considers both technical performance characteristics and the practical realities of how your team works with data in the Google Cloud environment.