External Tables vs Federated Queries in BigQuery
This article breaks down the key differences between external tables and federated queries in BigQuery, helping you understand when each approach makes sense for your data architecture.
When you need to query data that lives outside BigQuery, you have two main options: external tables in BigQuery vs federated queries. Both let you access data from sources like Cloud Storage, Bigtable, or Cloud SQL without loading it into BigQuery's native storage. The choice between them shapes query performance, cost structure, and operational complexity in ways that matter to production workloads.
Understanding this decision helps you design data architectures that balance flexibility with efficiency. The wrong choice can lead to slower queries, higher costs, or unnecessary data duplication. The right choice gives you clean separation between transient and permanent data access patterns.
What External Tables Offer
An external table in BigQuery creates a permanent table definition that points to data stored outside BigQuery's native columnar storage. You define the schema once, and the table appears in your dataset like any other table. The data stays where it is, but you can query it using standard SQL.
The table definition persists in your BigQuery dataset. You can grant permissions on it, reference it in views, and include it in scheduled queries. From a user perspective, querying an external table looks identical to querying a native table.
Consider a climate research institution that stores raw sensor readings from weather stations in Cloud Storage as CSV files. They might create an external table like this:
CREATE OR REPLACE EXTERNAL TABLE climate_data.station_readings
OPTIONS (
format = 'CSV',
uris = ['gs://weather-sensors/readings/*.csv'],
skip_leading_rows = 1
);
Once created, any analyst with access to the climate_data dataset can query this table without needing to know where the files live or how they're formatted. The table becomes part of your data catalog.
When External Tables Make Sense
External tables work well when you have data that changes frequently but needs to be queried regularly. A logistics company tracking package locations might export data from their operational database to Cloud Storage every hour. An external table pointing to these files lets their analytics team query current data without importing it into BigQuery storage.
They also make sense when you want to enforce consistent access patterns. If multiple teams need to query the same Cloud Storage files, an external table provides a single point of schema definition and access control. You avoid the situation where different teams interpret the same files differently or duplicate the federated query logic.
The performance profile is predictable. BigQuery scans the external data each time you query, but it applies predicate pushdown and projection when possible. For Parquet and ORC files, BigQuery can skip entire row groups based on metadata. For Avro and columnar formats, it reads only the columns you select.
Limitations of External Tables
External tables come with performance trade-offs that become apparent under load. Because data lives outside BigQuery's optimized storage, queries typically run slower than equivalent queries on native tables. BigQuery cannot use its full suite of optimizations like clustering, partitioning metadata pruning, or column-oriented compression benefits.
You also lose some query capabilities. External tables do not support DML operations like UPDATE, DELETE, or MERGE. You cannot create materialized views on top of them. Certain query patterns that rely on BigQuery's distributed shuffle architecture may hit limitations.
Cost implications differ from native tables. You pay for the bytes scanned from the external source plus any egress charges if data moves across regions. A video streaming service storing viewer engagement logs in Cloud Storage might find that repeatedly scanning the same files for different queries costs more than loading the data once into native BigQuery storage.
Consider this query on an external table pointing to JSON files:
SELECT
user_id,
COUNT(*) as view_count
FROM streaming_data.viewer_events
WHERE event_date = '2024-01-15'
GROUP BY user_id;
If the JSON files are not partitioned by date at the file path level, BigQuery must scan every file to find matching rows. With native tables, partitioning metadata would let BigQuery skip irrelevant partitions entirely before scanning any data.
How Federated Queries Work
Federated queries let you query external data sources directly within a SQL statement without creating a persistent table definition. You use the EXTERNAL_QUERY function to send a query to a database like Cloud SQL or execute a query against Cloud Storage files inline.
The query is transient. No metadata persists in BigQuery after the query completes. This approach works well for ad hoc analysis or one-off data migrations where you do not need repeated access.
A payment processor might need to join transaction data in BigQuery with customer account details stored in Cloud SQL. They can do this without creating an external table:
SELECT
t.transaction_id,
t.amount,
c.customer_name,
c.risk_score
FROM payments.transactions t
JOIN EXTERNAL_QUERY(
'projects/payment-prod/locations/us-central1/connections/cloudsql-conn',
'''SELECT customer_id, customer_name, risk_score
FROM accounts
WHERE status = "active"'''
) c
ON t.customer_id = c.customer_id
WHERE t.transaction_date = CURRENT_DATE();
The query runs once, joins the data, and returns results. No table definition remains in BigQuery. This keeps your dataset clean when you only need the data momentarily.
Benefits of Federated Queries
Federated queries shine in exploratory scenarios. A data scientist investigating a production issue might need to correlate application logs in BigQuery with database performance metrics in Cloud SQL. Running a federated query gives them immediate answers without requiring schema definitions or access control changes.
They also work well for data migration validation. When moving data from an operational database to BigQuery, you can use federated queries to compare row counts, checksums, or specific field values between the source and destination. Once validation completes, you do not need the connection anymore.
The approach minimizes metadata overhead in your Google Cloud environment. You do not accumulate table definitions for transient data access needs. Your BigQuery dataset stays focused on data you actually manage.
BigQuery's Federation Architecture
BigQuery handles external data access through a connection framework that varies by source type. For Cloud Storage, BigQuery reads files directly using the same high-bandwidth internal network that connects Google Cloud services. For Cloud SQL and other databases, BigQuery establishes connections through the BigQuery Connection API, which manages authentication and query translation.
The architecture impacts performance in specific ways. When you query an external table backed by Cloud Storage, BigQuery distributes the file reading across many workers. Each worker reads portions of the files in parallel. For columnar formats like Parquet, this parallelization works efficiently because workers can independently decode column chunks.
Database federation works differently. When you use EXTERNAL_QUERY against Cloud SQL, BigQuery sends your query to the Cloud SQL instance, which executes it using its own query engine. Cloud SQL returns results to BigQuery, which then processes them further. This means the Cloud SQL instance must have sufficient resources to handle the query load.
BigQuery's treatment of schema differs between external tables and federated queries. With external tables, you define the schema in BigQuery, and BigQuery validates it against the data at query time. With federated queries, the external system owns the schema. If column types change in the source database, your federated query might break without warning.
The service also handles caching differently. Results from external tables follow BigQuery's standard caching behavior. If you run the same query twice within 24 hours and the underlying files have not changed, BigQuery may return cached results. Federated queries to databases bypass this cache because BigQuery cannot track whether the external database data has changed.
Detailed Scenario: Agricultural Sensor Network
Consider an agricultural technology company that monitors soil conditions across thousands of farms. Sensors measure moisture, temperature, and nutrient levels every 15 minutes, generating roughly 400 GB of data daily. The data flows to Cloud Storage as compressed CSV files organized by date and farm region.
The company's data science team runs daily models to predict irrigation needs. Their business intelligence team creates dashboards showing current conditions. Compliance requirements mandate retaining raw sensor data for seven years, but active analysis focuses on the past 90 days.
Initial Implementation with Federated Queries
The team initially uses federated queries for flexibility:
SELECT
sensor_id,
AVG(moisture_level) as avg_moisture,
STDDEV(moisture_level) as moisture_variance
FROM EXTERNAL_QUERY(
'gs://farm-sensors/2024/01/*/*.csv',
FORMAT 'CSV',
SKIP_LEADING_ROWS 1
)
WHERE reading_timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY sensor_id;
This works for initial exploration, but problems emerge. Each analyst writes slightly different query syntax. Some forget to filter recent data and scan months of files unnecessarily. Query costs climb to $1,200 per day because every query scans files from scratch.
Switching to External Tables
The team creates external tables with Hive partitioning:
CREATE OR REPLACE EXTERNAL TABLE agriculture.sensor_readings
(
sensor_id STRING,
farm_id STRING,
reading_timestamp TIMESTAMP,
moisture_level FLOAT64,
temperature_celsius FLOAT64,
nitrogen_ppm INT64
)
WITH PARTITION COLUMNS (
date_partition DATE,
region STRING
)
OPTIONS (
format = 'CSV',
uris = ['gs://farm-sensors/*'],
hive_partition_uri_prefix = 'gs://farm-sensors',
skip_leading_rows = 1
);
Now queries use partition pruning:
SELECT
farm_id,
AVG(moisture_level) as avg_moisture
FROM agriculture.sensor_readings
WHERE date_partition BETWEEN '2024-01-15' AND '2024-01-22'
AND region = 'midwest'
GROUP BY farm_id;
BigQuery scans only files in the specified date range and region, reducing query costs by 75%. The consistent schema definition eliminates analyst confusion. The table appears in the data catalog, making it discoverable through the BigQuery UI.
Cost Analysis
With federated queries, the team averaged 12 TB of data scanned daily across all queries at $5 per TB, totaling $60 daily or $1,800 monthly. After switching to external tables with partitioning, scanned data drops to 3 TB daily because partition pruning eliminates irrelevant files. Monthly query costs fall to $450.
However, external tables add operational overhead. The team must maintain the table definition and update it when schema changes. When they add a new sensor type with additional fields, they need to alter the external table definition and communicate the change to users.
Decision Framework for Your Data Architecture
Choosing between external tables and federated queries depends on access patterns, data volume, and organizational factors. This table summarizes the key trade-offs:
| Factor | External Tables | Federated Queries |
|---|---|---|
| Query frequency | Best for repeated queries on the same data | Best for one-time or rare access |
| Schema management | Centralized in BigQuery, enforced consistently | Defined per query, flexible but inconsistent |
| Access control | Managed through BigQuery IAM | Requires connection-level permissions |
| Performance optimization | Supports partitioning, projection, predicate pushdown | Limited optimization, depends on source system |
| Metadata overhead | Creates persistent table definitions | No persistent metadata |
| Cost predictability | Easier to estimate with partition pruning | Varies by query complexity |
| Data catalog integration | Tables appear in BigQuery catalog | No catalog presence |
| Use case fit | Operational dashboards, recurring reports, team access | Data migration, ad hoc analysis, validation queries |
Use external tables when multiple users or applications need regular access to the same external data. The upfront cost of defining the table pays off through consistent schema interpretation, access control, and query optimization.
Use federated queries when you need temporary access to external data or when the data structure changes too frequently to maintain a table definition. A hospital network validating patient record migrations from legacy systems might run federated queries during the migration phase but switch to native tables once data stabilizes.
Relevance to Google Cloud Professional Data Engineer Certification
The Professional Data Engineer certification exam may test your understanding of when to use external tables versus federated queries. Questions might present scenarios where you need to choose the appropriate data access pattern based on query frequency, performance requirements, or cost constraints.
You might encounter case studies describing data architectures where you need to identify whether external tables or direct data loading would be more appropriate. The exam can include questions about the performance characteristics of external tables with different file formats or the limitations of querying external data sources.
Understanding how BigQuery's connection API works and how to configure external data sources appears in exam domains covering data processing and storage design. You should know which external data sources BigQuery supports, how partitioning works with external tables, and what operations are not available on external tables compared to native ones.
The exam emphasizes practical decision-making. Rather than memorizing that external tables are better or worse, focus on understanding the specific trade-offs that make each approach suitable for different situations. Know when the benefits of persistent metadata and access control justify the operational overhead of maintaining external table definitions.
Making the Right Choice for Your Workload
The decision between external tables in BigQuery vs federated queries comes down to permanence and repeatability. External tables make sense when you need ongoing access to external data with consistent schema interpretation and performance optimization. Federated queries work better for transient access needs where you want to avoid metadata accumulation.
Think about your data's lifecycle. If sensor data, application logs, or exported database snapshots will be queried repeatedly over days or weeks, external tables provide better performance and cost efficiency. If you need to validate a data migration or run a one-time analysis comparing data across systems, federated queries give you the flexibility without permanent metadata.
Consider your team structure. When multiple analysts or applications need access to the same external data, external tables provide consistent schema definitions and centralized access control. When individual data scientists need flexibility to explore different data sources, federated queries let them work independently without requiring schema definitions.
Both approaches have their place in well-designed data architectures on Google Cloud Platform. Knowing when to use each one demonstrates the kind of thoughtful engineering that turns data infrastructure into a competitive advantage rather than just a cost center.