BigQuery Federated Queries: Access Cloud SQL & Spanner
BigQuery federated queries let you analyze data across Cloud SQL, Spanner, and AlloyDB without moving it into BigQuery. This guide explains how they work and when to use them.
Understanding how to efficiently access and analyze data across multiple database systems is a critical skill for the Google Cloud Professional Data Engineer certification exam. One powerful capability that exam takers need to master is BigQuery federated queries, which enable you to query data stored in other Google Cloud databases directly from BigQuery without requiring data replication or movement.
For organizations running operational databases like Cloud SQL for transactional workloads or Cloud Spanner for global distributed applications, the ability to perform analytics on this data without creating complex ETL pipelines offers a significant architectural advantage. BigQuery federated queries solve this challenge by bringing the analytical power of BigQuery to data wherever it lives within the Google Cloud ecosystem.
What Are BigQuery Federated Queries
BigQuery federated queries are a feature that allows you to execute queries against data stored in external Google Cloud databases and retrieve the results as temporary tables within BigQuery. This capability works specifically with three GCP database services: Cloud SQL (supporting MySQL, PostgreSQL, and SQL Server), Cloud Spanner, and AlloyDB for PostgreSQL.
Rather than requiring you to extract data from your operational databases, load it into BigQuery, and then analyze it, federated queries let you run analytics directly on the source data. The query executes on the external database, and BigQuery receives the results for further processing or combination with data already stored in BigQuery tables.
The key distinction from other data access patterns is that federated queries query live data in real time. Each time you run a federated query, it connects to the external database, executes the query, and returns current results. This means you always work with the latest data without managing synchronization processes.
How BigQuery Federated Queries Work
The mechanics of federated queries rely on the EXTERNAL_QUERY() function in BigQuery SQL. This function takes two primary arguments: a connection resource that defines how to reach the external database, and a query string written in the SQL dialect of that external database.
Here's what happens when you execute a federated query. First, BigQuery uses the specified connection to establish a secure link to the external database. Second, BigQuery sends your query to that database for execution. The external database processes the query using its own query engine and returns results to BigQuery. Finally, BigQuery makes these results available as a temporary table that you can join with BigQuery native tables, filter further, or aggregate.
An important technical detail is that you must write the inner query using the SQL dialect of the target database, not BigQuery Standard SQL. If you're querying a Cloud SQL MySQL instance, you write MySQL syntax. If you're querying Cloud Spanner, you use Spanner SQL syntax. This requirement means you need familiarity with the syntax differences across database systems.
Setting up federated queries requires creating a connection resource in BigQuery that contains the authentication and network information needed to reach your external database. For Cloud SQL, this involves specifying the instance connection name and database credentials. For Cloud Spanner, you provide the Spanner instance and database identifiers.
Using Federated Queries with Different Database Services
Querying Cloud SQL
Cloud SQL is commonly used for operational applications that need managed relational databases. A healthcare technology company might run a Cloud SQL PostgreSQL database that stores patient appointment records, with their booking application writing transactions to this database continuously throughout the day.
To analyze appointment patterns without moving this operational data, the analytics team can use federated queries:
SELECT
appointment_date,
department,
COUNT(*) as total_appointments
FROM EXTERNAL_QUERY(
'projects/healthcare-prod/locations/us-central1/connections/clinic-db',
'''SELECT appointment_date, department
FROM appointments
WHERE appointment_date >= CURRENT_DATE - INTERVAL 30 DAY''')
GROUP BY appointment_date, department
ORDER BY appointment_date;
This query retrieves the last 30 days of appointments from Cloud SQL and aggregates them in BigQuery. The inner query uses PostgreSQL syntax, while the outer query uses BigQuery Standard SQL for aggregation.
Querying Cloud Spanner
Cloud Spanner provides globally distributed, strongly consistent databases. A multinational payment processor might use Spanner to store transaction records across regions, ensuring low latency for customers worldwide while maintaining consistency.
When the finance team needs to analyze transaction patterns for fraud detection without impacting operational performance, they can query Spanner from BigQuery:
WITH spanner_transactions AS (
SELECT *
FROM EXTERNAL_QUERY(
'projects/payments-global/locations/us/connections/spanner-prod',
'SELECT transaction_id, merchant_id, amount, country_code, timestamp FROM transactions WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)')
)
SELECT
country_code,
COUNT(*) as transaction_count,
SUM(amount) as total_volume
FROM spanner_transactions
GROUP BY country_code;
This approach allows the analytics team to combine recent Spanner transaction data with historical fraud patterns stored in BigQuery tables for machine learning model scoring.
Querying AlloyDB
AlloyDB for PostgreSQL offers high performance for demanding transactional and analytical workloads. A logistics company managing fleet operations might use AlloyDB to store real-time vehicle telemetry data including location, fuel consumption, and maintenance alerts.
Operations analysts can query this live data to understand current fleet status:
SELECT
v.vehicle_id,
v.current_location,
v.fuel_level,
h.warehouse_name
FROM EXTERNAL_QUERY(
'projects/logistics-ops/locations/us-central1/connections/alloydb-fleet',
'''SELECT vehicle_id, current_location, fuel_level, assigned_warehouse_id
FROM vehicle_status
WHERE last_update > NOW() - INTERVAL ''15 minutes'' ''')
AS v
JOIN `logistics-ops.warehouses.locations` h
ON v.assigned_warehouse_id = h.warehouse_id
WHERE v.fuel_level < 25;
This query combines live vehicle data from AlloyDB with warehouse reference data stored in BigQuery to identify vehicles running low on fuel near their assigned warehouses.
Key Capabilities and Features
BigQuery federated queries provide several important capabilities that make them valuable for specific use cases within Google Cloud architectures.
The most significant capability is real-time data access. Because federated queries execute against the live database, you always receive current data without waiting for batch ETL processes to complete. This is valuable when you need to make decisions based on the latest operational state.
Another key feature is the ability to join external data with BigQuery native tables in a single query. You can combine live operational data from Cloud SQL with historical data warehoused in BigQuery, enabling analysis that spans both real-time and historical contexts. A subscription box service could join current customer subscription status from Cloud SQL with historical purchase behavior in BigQuery to identify customers at risk of canceling.
Federated queries also preserve data in its operational form. You don't need to design and maintain separate schemas optimized for analytics. The source database continues serving transactional workloads while occasional analytical queries run through the federated connection. This reduces the complexity of maintaining multiple copies of data.
The connection resources used by federated queries support authentication and network security. For Cloud SQL, you can use Cloud SQL Auth Proxy for secure connections without managing SSL certificates. For services like Cloud Spanner, the connection uses IAM authentication, ensuring that access controls remain consistent with your broader GCP security model.
Why BigQuery Federated Queries Matter
The business value of federated queries comes from reducing the time and infrastructure required to make operational data available for analysis. Traditional approaches require building ETL pipelines that extract data from operational databases, transform it for analytical purposes, and load it into a data warehouse. This process introduces latency, requires additional infrastructure, and creates opportunities for synchronization issues.
A mobile game studio tracking player behavior might store game state and player actions in Cloud SQL to support fast gameplay. When the analytics team needs to understand player progression patterns to inform game design, federated queries let them access this data immediately without waiting for nightly batch loads. This faster feedback cycle helps the studio iterate more quickly on game features.
Cost efficiency is another important benefit. By querying data in place, you avoid the storage costs of duplicating data into BigQuery and the compute costs of running regular ETL jobs. For datasets that are queried infrequently or where only small subsets are needed for analysis, the cost savings can be substantial.
Federated queries also reduce operational complexity. You don't need to build and maintain data pipelines, manage scheduling for incremental loads, or handle schema evolution across multiple systems. A clinical research organization managing patient data in Cloud SQL can run compliance reports directly against the operational database without creating a separate reporting database that must stay synchronized.
The capability becomes particularly valuable when you need to combine small amounts of fresh operational data with large analytical datasets. A solar energy company might store current inverter performance metrics in AlloyDB for operational monitoring while maintaining years of historical performance data in BigQuery. Federated queries let them join today's readings with historical patterns to detect performance degradation without loading all historical data into the operational database or constantly syncing current data to BigQuery.
When to Use Federated Queries and When Not To
BigQuery federated queries are the right choice for specific scenarios but not appropriate for all data access patterns. Understanding these boundaries is important for the Professional Data Engineer exam and for making good architectural decisions.
Use federated queries when you need occasional access to relatively small amounts of current data from operational databases. If your analytics team runs daily reports that need to include the latest customer orders from Cloud SQL, federated queries provide a simple solution without complex infrastructure.
They work well when you want to join live operational data with large analytical datasets already in BigQuery. A telecommunications company analyzing network performance might join real-time cell tower metrics from Cloud Spanner with historical usage patterns and geographic data stored in BigQuery to identify coverage gaps.
Federated queries are appropriate when the operational database can handle the analytical query load without performance impact. If your Cloud SQL instance has sufficient capacity and your queries are well-optimized, the additional analytical queries won't disrupt transactional workloads.
However, you should not use federated queries as a replacement for proper data warehousing when you need to run frequent, complex analytical queries on large datasets. Each federated query executes on the external database, consuming resources that might be needed for operational workloads. Running hundreds of concurrent analytical queries through federated connections could degrade performance for your application users.
Avoid federated queries when you need consistent point-in-time snapshots for reporting or when query latency is critical. Federated queries involve network round trips to external databases and depend on the query performance of those databases. Complex aggregations or joins on large tables in Cloud SQL will be slower than equivalent queries on properly partitioned BigQuery tables.
They are also not suitable when you need to transform data significantly for analytical purposes. If your operational schema is highly normalized and requires complex joins to answer analytical questions, you're better off building ETL pipelines that denormalize and optimize data for BigQuery.
For high-volume streaming or batch analytics, use BigQuery native tables with proper data loading strategies. A rideshare platform processing millions of trip records per day should load this data into partitioned BigQuery tables rather than trying to query it repeatedly from the operational database.
Implementation Considerations for Google Cloud
Setting up BigQuery federated queries requires creating connection resources that define how BigQuery connects to your external databases. You create these connections using the BigQuery console, the bq command-line tool, or the BigQuery API.
For Cloud SQL, you need to create a connection that specifies the Cloud SQL instance connection string and database name. You also need to provide credentials, either through a dedicated database user or by using Cloud SQL's IAM database authentication. The service account used by BigQuery must have the appropriate permissions to connect to your Cloud SQL instance.
Here's how you create a Cloud SQL connection using the bq command-line tool:
bq mk --connection \
--location=us-central1 \
--project_id=your-project-id \
--connection_type=CLOUD_SQL \
--properties='{"instanceId":"your-project:us-central1:your-instance","database":"your-database","type":"POSTGRES"}' \
--connection_credential='{"username":"analytics-user","password":"your-password"}' \
cloudsql-connection
For Cloud Spanner connections, the setup is simpler because authentication uses IAM. You specify the Spanner instance and database, and BigQuery uses the permissions of the user or service account running the query:
bq mk --connection \
--location=us \
--project_id=your-project-id \
--connection_type=CLOUD_SPANNER \
--properties='{"database":"projects/your-project/instances/your-instance/databases/your-database"}' \
spanner-connection
Network connectivity is another important consideration. Your external database must be accessible from the BigQuery service. For Cloud SQL and AlloyDB, this typically means either using public IP addresses with authorized networks or configuring private service access through VPC peering. Cloud Spanner connections work over Google's internal network.
Query performance depends on several factors. The external database's query performance directly affects federated query speed. Ensure that your queries use appropriate indexes on the external database. The amount of data transferred also matters. Use WHERE clauses in your external query to filter data before it's sent to BigQuery rather than retrieving large datasets and filtering in BigQuery.
Cost considerations include both BigQuery query costs and potential costs from the external database. BigQuery charges for federated queries based on the amount of data processed, similar to regular queries. However, you also need to consider the compute resources consumed on the external database. For Cloud SQL, running resource-intensive analytical queries might require scaling up your instance size.
Quotas and limits apply to federated queries. BigQuery has limits on the number of concurrent queries to external databases and on connection timeouts. Long-running queries on the external database might hit timeout limits, requiring you to optimize the external query or break it into smaller chunks.
Integration with Other Google Cloud Services
Federated queries fit into broader Google Cloud data architectures by connecting BigQuery's analytical capabilities with other data services.
A common pattern combines federated queries with BigQuery's native data loading for a hybrid approach. You might use scheduled queries to periodically load historical data from Cloud SQL into BigQuery tables while using federated queries to access the latest data that hasn't been loaded yet. This gives you optimized performance for historical analysis while maintaining access to current operational data. A fraud detection system could analyze months of historical transactions from BigQuery native tables while checking recent transactions from the last hour via federated queries against Cloud Spanner.
Federated queries work well with Dataflow for streaming data pipelines. A video streaming platform might use Dataflow to process viewing events in real time and write aggregated metrics to Cloud Spanner. Business analysts can then use BigQuery federated queries to combine these near-real-time metrics with historical viewing patterns stored in BigQuery to understand trending content.
Integration with Cloud Functions or Cloud Run enables automated reporting that combines data from multiple sources. A serverless function triggered on a schedule could execute federated queries to pull current inventory levels from Cloud SQL, combine them with sales forecasts in BigQuery, and generate automated purchasing recommendations.
For machine learning workflows with Vertex AI, federated queries let you access training data from operational databases without creating separate data preparation pipelines. A customer churn prediction model could use federated queries to retrieve recent customer interaction data from AlloyDB and combine it with historical features stored in BigQuery for model training.
The integration with Identity and Access Management (IAM) ensures consistent security. Users need appropriate BigQuery permissions to run federated queries and appropriate permissions on the external database. A hospital network could use IAM to ensure that only authorized analysts can query patient data through federated connections, maintaining HIPAA compliance.
Bringing It All Together
BigQuery federated queries provide a powerful mechanism for accessing data across Cloud SQL, Cloud Spanner, and AlloyDB without moving it into BigQuery. By using the EXTERNAL_QUERY() function, you can execute queries on external databases and work with the results directly in BigQuery, joining them with native BigQuery tables or processing them further.
The value proposition centers on reducing complexity and latency for specific use cases where you need occasional access to current operational data. Rather than building and maintaining ETL pipelines for every analytical need, federated queries let you query data where it lives. This approach works best when the data volumes are manageable, query frequency is low to moderate, and the external database can handle the analytical load.
For Google Cloud Professional Data Engineers, understanding when to use federated queries versus other data integration patterns is essential. They represent one tool in a broader toolkit that includes native BigQuery tables, external tables pointing to Cloud Storage, and streaming ingestion through services like Dataflow. Choosing the right approach depends on your specific requirements for data freshness, query performance, cost, and operational complexity.
Whether you're designing a new data architecture on GCP or optimizing an existing one, federated queries offer a pragmatic way to extend BigQuery's analytical capabilities to data stored across Google Cloud's database services. Readers preparing for the Professional Data Engineer certification and looking for comprehensive coverage of BigQuery and other GCP data services can check out the Professional Data Engineer course for in-depth exam preparation.