Modeling Hierarchical Data in BigQuery: A Practical Guide

Understanding how to effectively model hierarchical data in BigQuery using nested and repeated fields can significantly improve query performance and reduce storage costs compared to traditional relational approaches.

Anyone who has worked with relational databases knows the challenge of representing hierarchical data. Whether you're dealing with organizational structures, product categories, order line items, or event streams with nested properties, the traditional approach involves creating multiple tables connected by foreign keys. This works, but it often leads to complex joins that can be expensive to execute and difficult to maintain.

BigQuery offers a different approach to modeling hierarchical data that takes advantage of its columnar storage architecture. Instead of forcing everything into flat tables, BigQuery supports nested and repeated fields that allow you to represent hierarchical relationships within a single table. This capability fundamentally changes how you think about schema design and can deliver significant performance benefits when used appropriately.

Understanding Nested and Repeated Fields

When working with hierarchical data in BigQuery, you have two primary constructs at your disposal: STRUCT types for nesting and ARRAY types for repetition. These can be combined to create complex hierarchical structures that mirror your actual data relationships.

A STRUCT is essentially a container that holds multiple fields of different types. Think of it as an object or record nested within a parent record. For example, an address might be represented as a STRUCT containing street, city, state, and postal code fields. This allows you to group related fields together logically while keeping them in the same table row.

An ARRAY represents a repeated field that can contain multiple values of the same type. This becomes powerful when you combine ARRAY with STRUCT to create repeated nested records. An order table might contain an ARRAY of STRUCT elements, where each STRUCT represents a line item with product ID, quantity, and price.

Here's a practical example showing how you might define a schema for an e-commerce order table:


CREATE TABLE orders (
  order_id STRING,
  order_date TIMESTAMP,
  customer_id STRING,
  shipping_address STRUCT<
    street STRING,
    city STRING,
    state STRING,
    postal_code STRING
  >,
  line_items ARRAY>,
  total_amount FLOAT64
);

This single table captures the entire order hierarchy without requiring separate tables for addresses or line items. Each order row contains all its related data in a denormalized structure.

Querying Hierarchical Data Structures

The real value of modeling hierarchical data in BigQuery becomes apparent when you start querying it. BigQuery provides special syntax for working with nested and repeated fields that allows you to access deeply nested values and expand arrays as needed.

To access fields within a STRUCT, you use dot notation. If you want to filter orders by city, you can reference the nested field directly:


SELECT order_id, order_date, total_amount
FROM orders
WHERE shipping_address.city = 'San Francisco';

Working with repeated fields requires understanding the UNNEST operation. When you have an ARRAY field, UNNEST allows you to expand it into a set of rows that can be joined or filtered. To find all orders containing a specific product, you would unnest the line items array:


SELECT 
  order_id,
  order_date,
  item.product_name,
  item.quantity
FROM orders,
UNNEST(line_items) AS item
WHERE item.product_id = 'PROD-123';

This query expands the line_items array and creates one row for each item, allowing you to filter and analyze at the individual item level. The power here is that BigQuery can perform this operation efficiently because of how it stores columnar data.

Performance Advantages and Trade-offs

The decision to model hierarchical data in BigQuery using nested structures rather than traditional normalized tables has significant performance implications. Because BigQuery stores data in a columnar format, it can read only the specific columns needed for a query. When your hierarchical data lives in a single table, BigQuery can scan just that table without performing expensive joins.

Consider the alternative: if you had separate tables for orders, addresses, and line items, a query that needed information from all three would require joining them together. In a system processing billions of rows, those joins become expensive operations that consume both time and computational resources. With nested data, you eliminate those joins entirely.

The storage efficiency can also be substantial. While it might seem counterintuitive that denormalized data would save space, BigQuery's columnar compression works exceptionally well with repeated values in nested structures. Additionally, you avoid the storage overhead of maintaining foreign key columns and indexes across multiple tables.

However, this approach does have trade-offs. If you frequently need to update individual nested records, the denormalized structure can be less efficient than normalized tables. BigQuery doesn't support updating individual array elements in place, so you would need to update the entire parent row. This makes the nested approach better suited for append-heavy workloads or scenarios where you batch updates periodically.

When to Use Nested Structures

The nested and repeated field approach works best for specific types of hierarchical data patterns. Event data with associated properties is an ideal use case. When you're ingesting clickstream data, application logs, or IoT sensor readings, each event often has multiple associated attributes. Storing these as nested structures keeps related data together and simplifies analysis.

One-to-many relationships where the child records have no independent meaning also benefit from this approach. Order line items only make sense in the context of their parent order. Product variants or image assets for a product listing similarly exist as attributes of the parent entity. These are natural candidates for nested structures.

Snapshot data that captures the state of a system at a point in time often works well with nesting. If you're taking daily snapshots of inventory across locations, each snapshot record might contain an array of location-specific details. Since you're not updating historical snapshots, the update limitations of nested data don't apply.

On the other hand, if you have many-to-many relationships or entities that have independent lifecycles and are referenced from multiple places, traditional normalized tables might still be the better choice. User accounts that are referenced from orders, reviews, and support tickets probably shouldn't be nested within each of those contexts.

Working with Deeply Nested Data

BigQuery supports nesting up to 15 levels deep, which allows for quite complex hierarchical structures. You might have an array of structs where each struct contains another array of structs. While this flexibility is powerful, deeper nesting can make queries more complex and harder to understand.

A common pattern is representing tree structures such as organizational hierarchies or category taxonomies. You might have an employees table where each employee record contains an array of direct reports:


CREATE TABLE employees (
  employee_id STRING,
  name STRING,
  title STRING,
  department STRING,
  direct_reports ARRAY>
);

This captures one level of hierarchy. For deeper trees, you might need recursive queries or multiple levels of nesting. The practical limit often comes from query complexity rather than technical constraints. A three or four level nesting is usually manageable, but beyond that, you should carefully consider whether the schema is serving your query patterns well.

Loading and Transforming Hierarchical Data

Getting hierarchical data into BigQuery with the proper nested structure requires attention to your data pipeline. If your source data is already in JSON format with nested objects and arrays, BigQuery can infer or use a provided schema to load it directly. The bq command line tool and client libraries support schema specification that includes nested and repeated fields.

When loading JSON data, you can let BigQuery auto-detect the schema or provide an explicit schema definition. For production pipelines, explicit schemas are generally preferable because they give you control over field types and ensure consistency:


bq load --source_format=NEWLINE_DELIMITED_JSON \
  --schema=schema.json \
  project:dataset.table \
  gs://bucket/data.json

If your source data comes from relational databases or flat files, you'll need to transform it during the load process. You can use Cloud Dataflow or BigQuery itself to perform these transformations. A common pattern is to load the flat data into staging tables, then use SQL with ARRAY_AGG and STRUCT constructors to build the nested structures:


CREATE OR REPLACE TABLE orders_nested AS
SELECT 
  o.order_id,
  o.order_date,
  o.customer_id,
  STRUCT(
    a.street,
    a.city,
    a.state,
    a.postal_code
  ) AS shipping_address,
  ARRAY_AGG(STRUCT(
    li.product_id,
    li.product_name,
    li.quantity,
    li.unit_price
  )) AS line_items,
  SUM(li.quantity * li.unit_price) AS total_amount
FROM orders_flat o
JOIN addresses_flat a ON o.address_id = a.address_id
JOIN line_items_flat li ON o.order_id = li.order_id
GROUP BY 1, 2, 3, 4;

This query takes normalized data from three tables and combines them into a single nested structure. The ARRAY_AGG function collects all line items for each order into an array, while STRUCT creates the nested address object.

Schema Evolution Considerations

One practical concern with hierarchical data in BigQuery is managing schema changes over time. BigQuery supports adding new fields to existing tables, including adding fields to nested structs. However, you cannot remove fields or change field types in ways that aren't compatible.

When you need to add a new field to a nested struct, you can alter the table schema. The new field will be NULL for existing rows. This works well for additive changes but means you need to think ahead about your schema design. Planning for optional fields from the start gives you flexibility as requirements evolve.

For major schema changes, you might need to create a new table with the updated schema and migrate data. This is where having your transformation logic defined in SQL or Dataflow pipelines pays off, as you can rerun the transformations with the new schema definition.

Monitoring and Query Optimization

Understanding how BigQuery processes queries against hierarchical data helps you optimize performance. The EXPLAIN statement shows you the query execution plan, including how BigQuery accesses nested fields and performs unnest operations. Looking at bytes processed and slot time consumed helps you identify expensive operations.

When querying nested data, selecting only the fields you need becomes even more important. If you have a large struct with many fields but only need one or two, explicitly selecting just those fields reduces the data BigQuery must read. The columnar storage means BigQuery can skip reading struct fields you don't reference.

Partitioning and clustering work with nested tables just as they do with flat tables. You can partition on a top-level date field and cluster on frequently filtered columns, including fields within structs. This helps BigQuery prune partitions and optimize data scanning even with complex nested structures.

Certification and Further Learning

Understanding how to model hierarchical data in BigQuery is covered in the Professional Data Engineer certification exam. The exam tests your ability to choose appropriate schema designs for different use cases and understand the trade-offs between normalized and denormalized approaches. You'll encounter scenarios where you need to decide whether nested structures or separate tables make more sense for a given workload.

The topic also appears in the Professional Cloud Architect certification in the context of designing data processing systems. Knowing when to use BigQuery's native hierarchical capabilities versus other approaches affects architectural decisions around data pipelines and analytics platforms.

Practical Takeaways

Modeling hierarchical data in BigQuery using nested and repeated fields offers significant advantages for the right use cases. When you have naturally hierarchical data with one-to-many relationships and your queries benefit from having related data colocated, nested structures can dramatically simplify your schema and improve query performance.

The approach works particularly well for event data, time-series data with attributes, and scenarios where you're primarily appending data rather than updating individual records. The elimination of joins and the efficient columnar storage make this pattern powerful for analytics workloads processing large volumes of data.

At the same time, this isn't a universal solution. Traditional normalized schemas still have their place, especially when you need to update data frequently or when you're working with many-to-many relationships. The key is understanding your access patterns and choosing the modeling approach that best serves your specific requirements. BigQuery gives you the flexibility to use both approaches within the same system, allowing you to optimize each dataset for its particular use case.