Repeated Fields vs Separate Child Tables in BigQuery

A detailed comparison of BigQuery's repeated fields and traditional normalized tables, helping you choose the right data modeling approach for performance and cost.

When designing data models in BigQuery, one of the most important decisions you'll face is choosing between repeated fields vs separate child tables in BigQuery. This choice fundamentally affects query performance, storage costs, and how you write SQL. Unlike traditional relational databases that push you toward normalization, BigQuery's columnar architecture opens up a different path: nested and repeated fields that can store arrays directly within a parent row.

This decision matters because getting it wrong can lead to slow queries, inflated costs, and complex joins that scale poorly. Understanding when to denormalize with repeated fields versus when to stick with separate child tables will make you a better data engineer and help you design efficient, cost-effective data warehouses.

Understanding Repeated Fields in BigQuery

A repeated field in BigQuery is essentially an array column that can hold multiple values of the same type within a single row. Think of it as storing one-to-many relationships directly inside the parent record rather than splitting them into a separate table.

Consider an e-commerce scenario where each order contains multiple line items. With repeated fields, your schema might look like this:


CREATE TABLE orders (
  order_id STRING,
  customer_id STRING,
  order_date TIMESTAMP,
  line_items ARRAY>
);

In this model, all line items live directly within the order row. When you query this table, you use the UNNEST function to expand the array when needed:


SELECT 
  order_id,
  item.product_id,
  item.quantity * item.unit_price AS line_total
FROM orders,
UNNEST(line_items) AS item
WHERE order_date >= '2024-01-01';

The power of repeated fields lies in data locality. All related information is stored together in the same physical blocks on disk. When BigQuery reads an order, it gets all its line items in one scan without needing to perform joins. This reduces the amount of data shuffled across the network and can dramatically speed up queries that need the complete picture of a parent and its children.

Repeated fields work best when the child data is always accessed together with the parent. Order line items, product attributes, and event properties are common examples. These relationships are tightly bound, and you rarely need to query the child records independently.

Drawbacks of Repeated Fields

Despite their advantages, repeated fields introduce several limitations that can become problematic as your data and query patterns evolve.

First, updating individual array elements is inefficient and expensive. BigQuery stores data in a columnar format optimized for reading, not updating. If you need to change a single line item within an order, you must rewrite the entire row:


UPDATE orders
SET line_items = ARRAY(
  SELECT AS STRUCT * 
  FROM UNNEST(line_items) 
  WHERE product_id != 'PROD123'
)
WHERE order_id = 'ORD456';

This operation is both slow and costly because BigQuery charges for the amount of data processed. You end up paying to read and rewrite the entire order row just to modify one array element.

Second, repeated fields become unwieldy when the child data grows large or varies significantly in size. If some orders have 3 line items and others have 300, you create a skewed data distribution that can hurt query performance. BigQuery has a 10 MB limit per row, and hitting that ceiling forces you to redesign your schema.

Third, querying child data independently becomes awkward. If you want to analyze all products sold across all orders without caring about which specific orders they came from, you must always unnest from the parent table. This creates unnecessary complexity and can scan more data than needed.

Separate Child Tables: The Traditional Approach

The alternative to repeated fields is the normalized relational model you learned in database design courses. You create separate tables for parents and children, linked by foreign keys:


CREATE TABLE orders (
  order_id STRING,
  customer_id STRING,
  order_date TIMESTAMP
);

CREATE TABLE order_line_items (
  order_id STRING,
  line_item_id STRING,
  product_id STRING,
  quantity INT64,
  unit_price FLOAT64
);

With this structure, you join the tables when you need combined information:


SELECT 
  o.order_id,
  li.product_id,
  li.quantity * li.unit_price AS line_total
FROM orders o
JOIN order_line_items li ON o.order_id = li.order_id
WHERE o.order_date >= '2024-01-01';

Separate child tables excel at flexibility. Each line item is its own row, making updates straightforward and cheap. You can modify, delete, or insert individual line items without touching any other data. This granular control is crucial for transactional workloads or datasets that change frequently.

Child tables also make independent analysis natural. Want to find the top-selling products across all time? Query the line items table directly without involving orders. Need to enforce referential integrity or add indexes? Traditional tables give you those tools.

The normalized approach scales better when child records vary dramatically in number. Your table structure remains consistent whether an order has one item or a thousand. You avoid row size concerns and distribute storage more evenly.

How BigQuery's Architecture Affects This Trade-Off

BigQuery's columnar storage and distributed execution engine change the calculus compared to traditional row-based databases. In systems like PostgreSQL or MySQL, joins are expensive operations that require index lookups and random disk access. Denormalization was often recommended to avoid join overhead.

BigQuery processes joins differently. It distributes both tables across hundreds or thousands of workers and performs parallel hash joins or broadcast joins depending on table sizes. When tables are properly partitioned and clustered, joins can be remarkably efficient. This means the performance penalty for normalized schemas is much smaller than in traditional databases.

However, BigQuery's columnar format gives repeated fields a unique advantage. When you store nested data, BigQuery uses a technique called record shredding. It still stores the data in columns, but keeps parent and child records co-located. This means reading an order with its line items requires fewer disk seeks and less network transfer than joining two separate tables.

The billing model also matters. BigQuery charges based on bytes scanned, not compute time. A query that joins two large tables scans data from both. If you only need a few columns from each, the columnar format helps, but you still pay for all rows that match the join condition. With repeated fields, you scan fewer total bytes because there's no join overhead and no duplicate parent data.

BigQuery's limitations around updates and mutations push you toward batch-oriented workflows. If your data arrives in bulk loads or scheduled pipelines rather than continuous updates, repeated fields become more attractive. The update penalty matters less when you rebuild partitions wholesale rather than modifying individual records.

A Detailed E-Commerce Scenario

Let's work through a concrete example to see how these choices play out. Imagine you're building analytics for an online retailer that processes 100,000 orders per day. Each order averages 5 line items. You need to support several query patterns:

  • Daily sales reports aggregating revenue by product
  • Customer lifetime value calculations joining orders to customers
  • Real-time dashboards showing recent order activity
  • Product recommendation models analyzing purchase patterns

With repeated fields, your orders table contains roughly 100,000 rows per day. Storage is efficient because you don't duplicate order metadata across line items. A query for daily revenue looks like:


SELECT 
  DATE(order_date) AS sale_date,
  item.product_id,
  SUM(item.quantity * item.unit_price) AS revenue
FROM orders,
UNNEST(line_items) AS item
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2;

This query scans the entire orders table but processes it efficiently because all data is co-located. If your orders table is partitioned by order_date and you typically query recent data, BigQuery prunes old partitions automatically.

With separate tables, you have 100,000 rows in orders and 500,000 rows in order_line_items per day. The same revenue query becomes:


SELECT 
  DATE(o.order_date) AS sale_date,
  li.product_id,
  SUM(li.quantity * li.unit_price) AS revenue
FROM orders o
JOIN order_line_items li ON o.order_id = li.order_id
WHERE o.order_date >= '2024-01-01'
GROUP BY 1, 2;

This query scans both tables. Even though you only need the date from orders, BigQuery must scan and join on order_id. If both tables are partitioned on order_date, performance remains good, but you process more total bytes.

Now consider a product recommendation model that needs to analyze which products are frequently purchased together. With repeated fields:


SELECT 
  a.product_id AS product_a,
  b.product_id AS product_b,
  COUNT(*) AS co_occurrence
FROM orders,
UNNEST(line_items) AS a,
UNNEST(line_items) AS b
WHERE a.product_id < b.product_id
GROUP BY 1, 2
ORDER BY co_occurrence DESC
LIMIT 100;

This self-join on the array is elegant and fast because BigQuery keeps all line items together. With separate tables, you need:


SELECT 
  a.product_id AS product_a,
  b.product_id AS product_b,
  COUNT(*) AS co_occurrence
FROM order_line_items a
JOIN order_line_items b ON a.order_id = b.order_id
WHERE a.product_id < b.product_id
GROUP BY 1, 2
ORDER BY co_occurrence DESC
LIMIT 100;

This query is more expensive because BigQuery must shuffle and join the line items table with itself. Even with clustering, you scan more data and perform more network operations.

However, if you need to update line items because prices changed or customers modified orders, the separate table approach wins. Updating a single line item costs a fraction of what rewriting entire order rows would cost.

Decision Framework for Choosing Your Approach

When deciding between repeated fields vs separate child tables in BigQuery, consider these factors:

FactorUse Repeated FieldsUse Separate Tables
Query PatternParent and children always accessed togetherChildren queried independently often
Update FrequencyRare updates, mostly append-onlyFrequent updates to individual child records
Child Record CountSmall to moderate, consistent across parentsHighly variable or very large per parent
Data ArrivalBatch loads with complete recordsStreaming or incremental updates
Join ComplexitySimple parent-child relationshipsMultiple joins across many tables
Row SizeWell under 10 MB per parent rowRisk of exceeding row size limits

Use repeated fields when your data model represents tight aggregations where the child data is essentially attributes of the parent. Event logs with properties, orders with line items, and products with multiple images are good candidates.

Use separate tables when children have independent identity and lifecycle. If you need to track history on child records, enforce foreign key relationships in your application logic, or query children across many parents efficiently, normalization makes more sense.

Relevance to Google Cloud Certification Exams

This topic can appear on the Professional Data Engineer certification and occasionally on the Professional Cloud Architect exam. Google emphasizes understanding BigQuery's unique capabilities, including nested and repeated fields, as part of effective data warehouse design.

You might encounter a scenario like this: A retail company wants to optimize their BigQuery schema for analyzing customer orders. They currently use separate tables for orders and line items, but query costs are high because most reports need both tables joined together. Order data is loaded once daily and rarely updated. Line items average 3 to 8 per order. What schema change would reduce costs?

The correct answer would be to use repeated fields to store line items within the orders table. The scenario contains key indicators: data is batch-loaded, updates are rare, parent and children are always queried together, and child count is moderate. Moving to repeated fields eliminates join overhead and reduces bytes scanned.

If the scenario mentioned frequent updates to line items or the need to analyze line items independently across millions of orders, separate tables would be the better choice. Exam questions test your ability to weigh these trade-offs based on specific requirements.

Making the Right Choice for Your Data Model

The decision between repeated fields vs separate child tables in BigQuery is not about finding a universally correct answer. Both approaches have legitimate uses, and understanding the trade-offs allows you to choose intelligently based on your specific workload.

Repeated fields leverage BigQuery's columnar architecture to deliver better performance and lower costs when parent and child data belong together. They simplify queries, reduce data scanned, and take advantage of co-location. Use them when your data model reflects true aggregation and you work in batch-oriented patterns.

Separate child tables provide flexibility, granular control, and clear separation of concerns. They handle variable-sized children better, make updates cheap, and support independent analysis. Choose them when children have independent identity or your workload includes frequent mutations.

The best data engineers don't dogmatically follow one approach. They analyze query patterns, understand cost implications, and make conscious trade-offs. As you design BigQuery schemas, ask yourself how the data will be queried, how often it changes, and whether the relationship truly represents parts of a whole or independent entities that happen to be related. That analysis will guide you to the right choice.