GCP Batch and Streaming Services: A Complete Guide

Understand which Google Cloud services handle batch processing, streaming data, or both. Learn when to use Dataflow, BigQuery, Pub/Sub, Dataproc, and Composer for your data pipelines.

Data engineers working with Google Cloud Platform face a fundamental architectural decision: should their pipelines process data in real-time streams or scheduled batches? The answer often determines which GCP services make sense for a given project. For professionals preparing for the Professional Data Engineer certification exam, understanding the distinctions between GCP batch and streaming services is essential because this choice affects pipeline architecture, cost structure, and data freshness requirements.

The challenge is that some workloads clearly fit one model while others benefit from a hybrid approach. A fraud detection system for a payment processor needs to analyze transactions within milliseconds of occurrence. A monthly financial reporting pipeline for a hospital network can process billing data in overnight batch jobs. Many scenarios fall somewhere in between, requiring services that handle both models effectively.

Understanding Batch and Stream Processing Models

Batch processing involves collecting data over a period of time, then processing the entire dataset together during scheduled windows. A furniture retailer might aggregate all daily sales transactions and run analytics overnight to update inventory forecasts and generate executive dashboards by morning.

Stream processing handles data records individually or in small groups as they arrive, enabling near real-time responses. A smart building sensor network continuously monitors temperature, occupancy, and energy usage, triggering immediate alerts when thresholds are exceeded and adjusting HVAC systems dynamically.

The distinction matters because these models have different performance characteristics, cost implications, and complexity requirements. Batch processing typically offers lower per-record costs and simpler implementation but introduces latency between data generation and insights. Stream processing delivers immediate results but requires more sophisticated pipeline architecture and often costs more per record processed.

Stream-Only Services in Google Cloud

Cloud Pub/Sub: Foundation for Real-Time Ingestion

Cloud Pub/Sub serves as the primary streaming data ingestion service within GCP. It functions as a globally distributed messaging service that decouples data producers from consumers, allowing systems to publish messages without knowing which services will process them.

A mobile game studio might use Pub/Sub to capture player events as they happen. When a player completes a level, makes an in-game purchase, or invites a friend, the game client publishes these events to Pub/Sub topics. Multiple downstream systems can subscribe to these topics: one subscription feeds a Dataflow pipeline for real-time player segmentation, another sends data to BigQuery for analytics, and a third triggers Cloud Functions for personalized push notifications.

The service guarantees at-least-once delivery, meaning messages will be delivered to subscribers but might occasionally arrive more than once. This design choice prioritizes reliability over strict ordering, which works well for the vast majority of streaming use cases. Pub/Sub automatically scales to handle traffic spikes without capacity planning, making it suitable for unpredictable workloads.

While technically possible to use Pub/Sub in batch scenarios by accumulating messages over time, the service is optimized for continuous data flow. Messages are typically retained for seven days by default, which provides a buffer for subscriber recovery but isn't designed for long-term storage.

Batch-Focused Services in GCP

Cloud Dataproc: Managed Hadoop and Spark

Cloud Dataproc provides managed clusters running Apache Hadoop and Apache Spark, making it the natural choice when existing batch processing code uses these frameworks. A genomics research lab processing DNA sequencing data might have years of investment in Spark-based pipelines. Dataproc allows them to run these workloads on Google Cloud infrastructure without rewriting code.

The service excels at processing large datasets in scheduled batches. A freight logistics company might spin up a Dataproc cluster each night to process the day's shipment tracking data, calculate delivery performance metrics, and optimize route planning for the next day. When processing completes, the cluster shuts down to avoid unnecessary costs.

Dataproc clusters can be created in under 90 seconds, which enables an ephemeral cluster pattern where you create a cluster for a specific job, process the data, then delete the cluster. This approach minimizes cost compared to long-running clusters. The service integrates with Cloud Storage for input and output data, allowing data persistence independent of cluster lifecycle.


# Create a Dataproc cluster for batch processing
gcloud dataproc clusters create analytics-cluster \
  --region=us-central1 \
  --num-workers=4 \
  --worker-machine-type=n1-standard-4 \
  --max-idle=30m

# Submit a Spark job
gcloud dataproc jobs submit spark \
  --cluster=analytics-cluster \
  --region=us-central1 \
  --jar=gs://my-bucket/analytics-job.jar

Cloud Composer: Workflow Orchestration

Cloud Composer manages complex batch workflows with dependencies between tasks. Built on Apache Airflow, it schedules and monitors multi-step data pipelines where each step must complete successfully before the next begins.

An online learning platform might use Composer to orchestrate a daily data pipeline that first extracts student activity logs from Cloud Storage, then triggers a Dataproc job to calculate engagement metrics, loads results into BigQuery, generates summary tables, and finally sends completion notifications. If any step fails, Composer can retry tasks, send alerts, or trigger compensation logic.

The service provides a web interface showing pipeline execution history, task status, and dependencies. Data engineers can define workflows as Python code using Directed Acyclic Graphs (DAGs), which specify task relationships and execution conditions.


from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

with DAG('daily_batch_pipeline', 
         default_args=default_args,
         schedule_interval='0 2 * * *') as dag:
    
    process_logs = DataprocSubmitJobOperator(
        task_id='process_activity_logs',
        job={'reference': {'project_id': 'my-project'}},
        region='us-central1'
    )
    
    load_results = BigQueryInsertJobOperator(
        task_id='load_to_warehouse',
        configuration={'query': {'query': 'INSERT INTO...',
                                'useLegacySql': False}}
    )
    
    process_logs >> load_results

While Composer primarily targets batch workflows, it can orchestrate streaming pipelines by managing their lifecycle rather than processing streaming data directly.

Hybrid Services: Handling Both Batch and Stream Processing

Cloud Dataflow: Unified Stream and Batch Processing

Cloud Dataflow represents Google Cloud's most versatile data processing service because it handles both streaming and batch workloads using the same programming model. The service implements Apache Beam, an open source unified API whose name literally combines the words "batch" and "stream."

This unification matters because it allows data engineers to write pipeline logic once and execute it in either mode. A subscription box service might build a customer segmentation pipeline in Dataflow. During initial development, they run it in batch mode against historical order data stored in Cloud Storage. Once validated, they switch the same pipeline to streaming mode, reading new orders from Pub/Sub as they arrive and updating customer segments in real time.

The service automatically handles complexities like windowing (grouping stream data by time periods), watermarking (determining when data for a time window is complete), and exactly-once processing semantics. A solar farm monitoring system might use tumbling windows to aggregate sensor readings every five minutes, calculating average power output while handling late-arriving data from sensors with intermittent connectivity.

Dataflow scales horizontally by adding or removing worker instances based on data volume and processing requirements. Unlike Dataproc, you don't manually configure cluster size. The service makes scaling decisions automatically, which simplifies operations but reduces fine-grained cost control.


import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

# Same pipeline logic works for batch or streaming
class ParseTransaction(beam.DoFn):
    def process(self, element):
        # Parse and validate transaction data
        return [{'user_id': element['user'], 'amount': float(element['amount'])}]

# Batch execution reading from Cloud Storage
with beam.Pipeline(options=PipelineOptions()) as p:
    (p 
     | 'ReadBatch' >> beam.io.ReadFromText('gs://bucket/transactions/*.json')
     | 'Parse' >> beam.ParDo(ParseTransaction())
     | 'Aggregate' >> beam.CombinePerKey(sum)
     | 'Write' >> beam.io.WriteToBigQuery('project:dataset.results'))

# Streaming execution reading from Pub/Sub
with beam.Pipeline(options=PipelineOptions(streaming=True)) as p:
    (p 
     | 'ReadStream' >> beam.io.ReadFromPubSub(topic='projects/my-project/topics/transactions')
     | 'Parse' >> beam.ParDo(ParseTransaction())
     | 'Window' >> beam.WindowInto(beam.window.FixedWindows(60))
     | 'Aggregate' >> beam.CombinePerKey(sum)
     | 'Write' >> beam.io.WriteToBigQuery('project:dataset.results'))

BigQuery: Analytics for Batch and Streaming Data

BigQuery functions as a serverless data warehouse that supports both traditional batch queries and streaming inserts, making it suitable for diverse analytical workloads. The service separates storage from compute, allowing you to query petabyte-scale datasets without managing infrastructure.

For batch analytics, a hospital network might run daily queries aggregating patient admission records, calculating average length of stay, and identifying capacity trends. These queries scan large historical datasets, and BigQuery's columnar storage and distributed execution model delivers results in seconds even against years of data.

The streaming insert capability allows near real-time data availability. A telehealth platform can stream patient vital signs from monitoring devices directly into BigQuery tables using the streaming API. Medical staff can query this data almost immediately, building dashboards that show current patient status across facilities.

The service charges separately for storage, batch query processing, and streaming inserts. Batch queries are billed per terabyte scanned, which incentivizes partitioning and clustering strategies that minimize data scanned. Streaming inserts cost more per row than batch loading but enable immediate data availability.

BigQuery ML extends the platform by allowing machine learning model training and prediction using SQL syntax. A financial services company could train a fraud detection model on historical transaction data in batch mode, then apply that model to streaming transactions as they arrive.

Choosing the Right Service for Your Workload

The decision framework starts with understanding your latency requirements. If insights must be available within seconds of data generation, you need streaming capabilities. A ride-sharing platform matching drivers to passengers can't wait for batch processing windows. This scenario calls for Pub/Sub to ingest ride requests, Dataflow to process matching logic, and potentially BigQuery for analytics on completed rides.

When latency requirements are measured in hours or days, batch processing typically costs less and proves simpler to implement. A climate modeling research project processing satellite imagery from daily collection runs can use Cloud Storage for data landing, Dataproc for analysis using existing Spark code, and Cloud Composer to orchestrate the workflow.

Many organizations need both models for different parts of their data platform. An esports platform might stream player match data through Pub/Sub to Dataflow for real-time leaderboard updates, while also running nightly batch jobs in Dataproc to calculate complex player skill ratings that require analyzing the full dataset.

Consider existing technical investments and team skills. Organizations with significant Hadoop or Spark code benefit from Dataproc because it runs existing jobs with minimal changes. Teams comfortable with SQL might prefer BigQuery for both batch and streaming analytics. Those building new pipelines without legacy constraints often choose Dataflow for its unified programming model and managed scaling.

Integration Patterns Across GCP Services

These services rarely work in isolation. Common patterns combine multiple GCP components into complete data platforms. A typical streaming architecture flows data from Pub/Sub (ingestion) to Dataflow (transformation) to BigQuery (analysis) with results visualized in Looker or Data Studio.

A podcast network might implement this pattern by publishing listener analytics events to Pub/Sub as episodes play. Dataflow enriches these events with user profile data from Cloud Firestore, aggregates listening patterns, and streams results into BigQuery. Data analysts query BigQuery to understand episode performance, audience retention, and content preferences.

Batch pipelines often use Cloud Storage as a central data lake. A transit authority might store raw ridership data from turnstiles and buses in Cloud Storage. Cloud Composer orchestrates a nightly workflow that triggers Dataproc jobs to process this data, loads results into BigQuery tables, and generates reports. The same BigQuery tables support both scheduled batch queries for planning purposes and ad hoc analysis by transportation analysts.

The choice between these services also affects how you handle late-arriving data, exactly-once processing guarantees, and state management. Dataflow provides sophisticated windowing and watermarking for handling late data in streaming pipelines. BigQuery handles deduplication through merge statements. Pub/Sub guarantees at-least-once delivery but leaves deduplication to consumers.

Cost and Performance Considerations

Batch processing generally costs less per record because it amortizes overhead across large datasets and allows flexible scheduling during low-demand periods. A university system processing enrollment data might run intensive batch jobs overnight when computing resources are cheaper and don't compete with student-facing systems.

Streaming processing costs more per record but delivers value through immediacy. A credit card payment processor can't batch fraud detection because approving fraudulent transactions causes direct financial losses. The incremental cost of stream processing is justified by prevented fraud losses.

Dataflow pricing includes worker vCPUs, memory, and persistent disk used during pipeline execution. Because the service autoscales, costs correlate directly with data volume and processing complexity. You can reduce costs by optimizing pipeline code, using appropriate machine types, and enabling features like Flexible Resource Scheduling that allows GCP to delay batch jobs for discounted pricing.

Dataproc charges for the Compute Engine instances in your cluster plus a small Dataproc management fee. The ephemeral cluster pattern minimizes costs by only paying for compute during active processing. Preemptible VMs can reduce costs by up to 80% for fault-tolerant batch workloads.

BigQuery storage is inexpensive compared to query costs. Active storage costs around $0.02 per GB monthly, while long-term storage (tables unchanged for 90 days) drops to $0.01 per GB. Query costs depend on data scanned, making partitioning and clustering critical for cost control on large tables.

Key Takeaways for Data Engineers

Understanding which Google Cloud services fit batch processing, streaming data, or both models guides architectural decisions that affect performance, cost, and maintenance complexity. Pub/Sub provides the foundation for streaming ingestion. Dataproc and Composer specialize in batch workloads with support for existing Hadoop/Spark code and complex workflow orchestration. Dataflow and BigQuery bridge both worlds, offering flexibility for pipelines that need to handle historical analysis and real-time processing.

The Professional Data Engineer exam tests your ability to select appropriate services based on requirements like latency, data volume, existing code investments, and cost constraints. Recognizing that some workloads clearly fit one processing model while others benefit from hybrid approaches demonstrates the architectural judgment expected of data engineering professionals.

For those seeking comprehensive preparation covering these services and their practical application in exam scenarios, the Professional Data Engineer course provides detailed coverage of batch and streaming architectures on Google Cloud Platform.