Avoiding Exploding Rows When Using UNNEST in BigQuery
UNNEST operations in BigQuery can unexpectedly multiply your row counts, leading to incorrect aggregations and poor query performance. This guide explains why this happens and how to structure your queries correctly.
When you start working with nested and repeated fields in BigQuery, UNNEST seems straightforward at first. You have an array column, you want to access the individual elements, so you use UNNEST to flatten it. But then something strange happens: your query returns far more rows than you expected, your aggregations produce wildly incorrect numbers, and suddenly a simple analysis becomes a debugging nightmare. This phenomenon, often called "exploding rows" or "row explosion," catches even experienced data analysts off guard when working with Google Cloud's data warehouse.
The challenge with avoiding exploding rows when using UNNEST in BigQuery isn't just about getting the syntax right. It's about understanding how UNNEST interacts with your data model, especially when you're dealing with multiple arrays in the same table. The consequences of getting this wrong extend beyond incorrect results. In a GCP environment where query costs are tied directly to bytes processed, an accidental Cartesian product from multiple UNNESTs can turn an inexpensive analytical query into a budget problem.
Why UNNEST Causes Row Explosion
The core issue stems from how UNNEST generates rows from array fields. When you UNNEST a single array column, BigQuery creates one output row for each element in that array. A table row with an array containing five elements becomes five rows in your result set. This behavior is expected and usually desired.
The problem emerges when you UNNEST multiple array columns from the same table row. Consider a solar farm monitoring system that tracks panel performance. Each monitoring station record might contain an array of temperature readings and a separate array of power output measurements, both captured at regular intervals. If you naively UNNEST both arrays in the same query, BigQuery creates the Cartesian product: every temperature reading gets paired with every power output measurement.
Here's what this looks like in practice:
SELECT
station_id,
temp,
power_output
FROM
solar_monitoring.stations,
UNNEST(temperature_readings) AS temp,
UNNEST(power_output_readings) AS power_output
If a single station record has 24 temperature readings and 24 power output readings, this query produces 576 rows (24 × 24) from that one source row. This explosion happens because BigQuery treats multiple UNNEST operations as separate joins, creating a cross product between the arrays.
The Correct Approach: Correlated UNNEST
The key insight for avoiding exploding rows when using UNNEST in BigQuery is recognizing when your arrays are meant to be processed together versus independently. Arrays that represent parallel collections (measurements taken at the same time intervals, related attributes of the same entities) need to be handled differently than independent collections.
For parallel arrays, BigQuery provides a solution: UNNEST multiple arrays together with positional correlation using the WITH OFFSET clause. This approach pairs array elements by their position rather than creating a Cartesian product.
SELECT
station_id,
temp,
power_output
FROM
solar_monitoring.stations,
UNNEST(temperature_readings) AS temp WITH OFFSET AS temp_offset,
UNNEST(power_output_readings) AS power_output WITH OFFSET AS power_offset
WHERE
temp_offset = power_offset
Now the same station record with 24 elements in each array produces exactly 24 rows, with each temperature reading correctly matched to its corresponding power output measurement. The WITH OFFSET clause exposes the zero-based array index, and the WHERE condition ensures elements are only paired when they occupy the same position.
This pattern works because Google Cloud's BigQuery maintains array ordering, making positional correlation reliable. However, this approach requires that your arrays are designed with this relationship in mind. The arrays must have the same length and their elements must be meaningfully related by position.
When Arrays Should Be Processed Independently
Sometimes row explosion is exactly what you need. A mobile game studio might store player data where each record contains an array of achievements unlocked and a separate array of in-game purchases made. These are independent collections that happen to belong to the same player, but there's no positional relationship between them.
If you need to analyze the relationship between achievement unlocking and purchase behavior, you might legitimately want to examine all combinations. However, you need to structure your query carefully to get meaningful results rather than nonsensical aggregations.
A better approach for independent arrays is often to UNNEST them in separate subqueries and then join or union the results:
WITH player_achievements AS (
SELECT
player_id,
achievement,
achievement_date
FROM
game_data.players,
UNNEST(achievements) AS achievement
),
player_purchases AS (
SELECT
player_id,
purchase_item,
purchase_date,
purchase_amount
FROM
game_data.players,
UNNEST(purchases) AS purchase_item
)
SELECT
a.player_id,
COUNT(DISTINCT a.achievement) AS achievement_count,
COUNT(DISTINCT p.purchase_item) AS purchase_count,
SUM(p.purchase_amount) AS total_spent
FROM
player_achievements a
FULL OUTER JOIN
player_purchases p USING(player_id)
GROUP BY
a.player_id
This structure prevents row explosion during the UNNEST operations and gives you control over how the unnested data combines. Each array is flattened independently, and the join happens at the player level where it logically belongs.
Detecting Row Explosion in Your Queries
One challenge with row explosion is that it's not always immediately obvious when it occurs, especially in complex queries with multiple joins and transformations. A freight logistics company analyzing shipment data might have deeply nested structures representing routes, cargo items, and checkpoints. Accidentally creating a Cartesian product between these arrays could produce millions of spurious rows that look plausible until you examine the aggregations closely.
Several warning signs indicate potential row explosion:
- Aggregations that produce unexpectedly large numbers, particularly counts and sums
- Query results where the row count far exceeds the source table row count in ways that don't align with your data model
- Duplicate combinations of values that shouldn't appear together
- Dramatically increased bytes processed compared to similar queries
When debugging a query in Google Cloud's BigQuery console, add row counts at different stages of your query using COUNT(*) in CTEs (Common Table Expressions). This helps you see exactly where row multiplication occurs:
WITH step1 AS (
SELECT *, COUNT(*) OVER() AS row_count_step1
FROM logistics.shipments
),
step2 AS (
SELECT *, COUNT(*) OVER() AS row_count_step2
FROM step1, UNNEST(cargo_items) AS item
),
step3 AS (
SELECT *, COUNT(*) OVER() AS row_count_step3
FROM step2, UNNEST(route_checkpoints) AS checkpoint
)
SELECT
MAX(row_count_step1) AS rows_after_step1,
MAX(row_count_step2) AS rows_after_step2,
MAX(row_count_step3) AS rows_after_step3
FROM step3
If you see the row count jump unexpectedly between steps, you've found your explosion point.
Structural Patterns to Prevent Explosion
Beyond specific query techniques, your BigQuery schema design and data modeling choices significantly impact whether you'll encounter row explosion issues. When designing nested structures for Google Cloud storage, consider whether arrays within the same record represent related or independent collections.
For a telehealth platform tracking patient interactions, you might store video consultation sessions with arrays for both chat messages and vital sign measurements taken during the call. If these measurements are timestamped and meant to be analyzed together, structure your schema to support positional correlation or use a STRUCT array that bundles related data:
CREATE TABLE telehealth.consultations (
consultation_id STRING,
patient_id STRING,
session_data ARRAY>
);
This structure eliminates the possibility of explosion because there's only one array to UNNEST. Each element of the array contains all related data for that timestamp. When you UNNEST this array, you get exactly one row per session data point, with all related fields naturally aligned.
The STRUCT approach works best when data elements are captured together or have a strong logical grouping. For truly independent collections, keeping them as separate arrays is appropriate, but document this design decision so future query authors understand the relationship (or lack thereof) between the arrays.
Performance Implications in GCP
Row explosion doesn't just affect correctness. It has real performance and cost implications in Google Cloud Platform. BigQuery's pricing model charges based on bytes processed, and queries with accidental Cartesian products process far more data than necessary. A query that should scan a few gigabytes might end up processing terabytes of intermediate data.
Beyond direct costs, exploded queries consume more slot time (BigQuery's unit of computational capacity). In organizations using on-demand pricing, this means slower query execution. For those using reserved slots, it means less capacity available for other workloads. A climate research organization processing satellite imagery data with nested arrays for multiple sensor readings could inadvertently consume their entire daily slot allocation with a single poorly structured query.
The GCP BigQuery execution plan provides insights into these issues. Examining query statistics in the BigQuery console shows bytes shuffled and slot time consumed. Queries with row explosion typically show disproportionately high shuffle operations as BigQuery materializes the exploded intermediate results.
Working with Multiple Levels of Nesting
Row explosion becomes even trickier with deeply nested data structures. A podcast network might store episode data with nested arrays: each show has episodes, each episode has chapters, and each chapter has sponsor segments. UNNESTing multiple levels requires careful consideration of what level of granularity your analysis needs.
SELECT
show_name,
episode.title,
chapter.chapter_name,
sponsor.sponsor_name,
sponsor.duration_seconds
FROM
podcast_network.shows,
UNNEST(episodes) AS episode,
UNNEST(episode.chapters) AS chapter,
UNNEST(chapter.sponsors) AS sponsor
This cascading UNNEST is valid when you genuinely need chapter-sponsor level granularity. Each UNNEST operates on a field from the previous level, so there's no Cartesian product between unrelated arrays. However, if you only need show-level sponsor aggregations, performing all these UNNEST operations wastes processing resources. Query at the appropriate level of detail.
Common Scenarios and Solutions
Several patterns appear frequently when working with UNNEST in BigQuery across different industries and use cases:
Time series data with multiple metrics: When storing IoT sensor data from agricultural monitoring systems with arrays of temperature, humidity, and soil moisture readings taken at intervals, use WITH OFFSET to maintain temporal alignment or restructure as STRUCT arrays with timestamps.
Hierarchical relationships: An online learning platform with courses containing modules, modules containing lessons, and lessons containing assessments should UNNEST hierarchically rather than in parallel. Each level's UNNEST references the previous level's unnested field.
Tag or label arrays: When a hospital network stores patient records with separate arrays for diagnoses, medications, and procedures, these are independent collections. UNNEST them in separate CTEs and aggregate at the patient level, not as a Cartesian product.
Event sequences: A payment processor tracking transaction events might have arrays of authorization attempts and settlement events. These are related by transaction but occur independently. Use separate CTEs with appropriate joins rather than multiple UNNEST in the same FROM clause.
Testing and Validation Strategies
When developing queries with UNNEST in Google Cloud BigQuery, build validation into your workflow. Start with a small sample of data and verify row counts at each transformation stage. Use ASSERT statements in scripts to ensure row counts fall within expected ranges:
DECLARE expected_max_rows INT64 DEFAULT 1000000;
DECLARE actual_rows INT64;
SET actual_rows = (
SELECT COUNT(*)
FROM (
-- Your query with UNNEST operations
)
);
ASSERT actual_rows <= expected_max_rows
AS 'Row count exceeds expected maximum, possible explosion';
For production queries, implement monitoring on bytes processed and execution time. Sudden increases in these metrics often indicate a query regression that introduced row explosion. Many organizations running analytics workloads on GCP set up alerting when specific queries exceed baseline resource consumption by significant margins.
Actionable Guidelines
When working with UNNEST operations in BigQuery, apply these principles to avoid exploding rows:
- Identify whether arrays in your data represent parallel collections (positionally related) or independent collections before writing queries
- Use WITH OFFSET and position matching when UNNESTing parallel arrays that should maintain element correspondence
- Process independent arrays in separate CTEs and join results at the appropriate granularity level
- Consider restructuring parallel arrays as STRUCT arrays to bundle related data and prevent explosion by design
- Add intermediate row counts during query development to detect explosion early
- Review bytes processed and slot time metrics for queries involving UNNEST to catch performance issues
- Document array relationships in schema descriptions so other query authors understand the intended data model
Understanding row explosion with UNNEST is valuable for GCP certification preparation, particularly the Professional Data Engineer exam. Scenarios involving nested and repeated fields appear regularly, and questions often test whether candidates recognize when UNNEST operations will produce unexpected results or performance problems.
Building the Right Mental Model
The key to avoiding exploding rows when using UNNEST in BigQuery is shifting from thinking about UNNEST as a simple "array flattening" operation to understanding it as a join operation. Each UNNEST in your FROM clause creates a lateral join between the source row and its array elements. Multiple UNNEST operations create multiple joins, and without proper correlation, these joins form Cartesian products just like joining unrelated tables without a join condition would.
This mental model helps you anticipate when explosion will occur and structure your queries accordingly. You wouldn't join two large tables without a WHERE clause connecting them, and similarly, you shouldn't UNNEST multiple unrelated arrays in the same query scope without considering how they should combine.
Working effectively with nested data in Google Cloud requires practice and attention to your specific data model. The same query patterns that work perfectly for one schema might produce explosions with another. Take time to understand your array structures, test queries on sample data, and validate results against expected row counts. With experience, recognizing and preventing row explosion becomes second nature, letting you leverage BigQuery's powerful support for complex data types without the pitfalls.