How to Migrate Kafka to Pub/Sub: A Gradual Strategy
A hands-on guide for data engineers on implementing a gradual migration from on-premises Apache Kafka to Google Cloud Pub/Sub using hybrid architectures and Kafka connectors.
Learning how to migrate Kafka to Pub/Sub is essential knowledge for anyone preparing for the Professional Data Engineer certification. You'll commonly encounter exam scenarios involving on-premises Apache Kafka systems that need to move to a managed, autoscaling solution in the cloud. When you see questions about migrating from on-premises Kafka, the answer will typically point to Google Cloud Pub/Sub as the target messaging system.
This tutorial walks you through implementing a gradual migration strategy that allows you to move from on-premises Kafka to Cloud Pub/Sub without disrupting existing operations. Instead of attempting a risky big-bang migration, you'll build a hybrid architecture where both systems work together during the transition period. By the end of this guide, you'll have a working implementation that bridges Kafka and Pub/Sub while progressively moving workloads to the cloud.
Why This Migration Matters
Organizations migrate from on-premises Kafka to Cloud Pub/Sub for several compelling reasons. Pub/Sub provides autoscaling capabilities that eliminate capacity planning concerns, offers global reach for distributed applications, and reduces operational overhead by removing the need to manage Kafka clusters. The migration becomes particularly valuable when you need to integrate with other Google Cloud services like BigQuery, Dataflow, or Cloud Functions.
A gradual migration strategy reduces risk by allowing you to validate the cloud setup with non-critical workloads first, maintain business continuity throughout the transition, and roll back quickly if issues arise. This approach also gives your team time to develop expertise with GCP messaging patterns before fully committing to the new platform.
Prerequisites and Requirements
Before starting this tutorial, ensure you have a Google Cloud project with billing enabled, the gcloud CLI installed and configured, and Editor or Owner role on your GCP project. You'll need an existing Apache Kafka cluster (on-premises or cloud-hosted), access to configure Kafka Connect on your Kafka cluster, and basic understanding of messaging systems and publish-subscribe patterns. Set aside approximately 2-3 hours to complete the full setup.
You'll also need the Pub/Sub API enabled in your Google Cloud project. If you haven't enabled it yet, use this command:
gcloud services enable pubsub.googleapis.com --project=YOUR_PROJECT_ID
Overview of the Migration Architecture
The gradual migration strategy involves running Kafka and Pub/Sub in parallel during a transition period. You'll use Google Cloud Kafka connectors to establish bidirectional communication between the two systems. This hybrid setup allows legacy workloads to continue running on Kafka while new cloud-native workloads get built on Pub/Sub.
The key components include your existing Kafka cluster, Kafka Connect workers running the Pub/Sub connectors, Cloud Pub/Sub topics and subscriptions, and your application code that will progressively shift from Kafka consumers to Pub/Sub subscribers. The connectors handle message translation and transport between the two platforms, ensuring data flows reliably during the migration.
Step 1: Create Cloud Pub/Sub Topics
Start by creating the Pub/Sub topics that will receive messages from your Kafka topics. For this tutorial, we'll migrate a logistics company's shipment tracking system that currently uses Kafka to process real-time location updates from delivery vehicles.
Create a topic for shipment events:
gcloud pubsub topics create shipment-events \
--project=YOUR_PROJECT_ID \
--message-retention-duration=7d
Create a corresponding subscription that your cloud applications will use:
gcloud pubsub subscriptions create shipment-events-sub \
--topic=shipment-events \
--ack-deadline=60 \
--message-retention-duration=7d \
--expiration-period=never
The message retention duration ensures you can replay messages if needed during the migration. The ack deadline gives your applications enough time to process each message before it becomes available for redelivery.
Verify Topic Creation
Confirm your topic was created successfully:
gcloud pubsub topics describe shipment-events --project=YOUR_PROJECT_ID
You should see output showing the topic name, message storage policy, and retention settings. If you encounter permission errors, verify that the Pub/Sub API is enabled and your account has the necessary IAM roles.
Step 2: Create a Service Account for Kafka Connectors
The Kafka connectors need credentials to authenticate with Google Cloud. Create a dedicated service account with minimal permissions following the principle of least privilege.
Create the service account:
gcloud iam service-accounts create kafka-connector \
--display-name="Kafka to Pub/Sub Connector" \
--project=YOUR_PROJECT_ID
Grant the service account permission to publish to Pub/Sub topics:
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:kafka-connector@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.publisher"
If you plan to use source connectors that read from Pub/Sub into Kafka, also grant subscriber permissions:
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:kafka-connector@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/pubsub.subscriber"
Download the service account key file that the connector will use:
gcloud iam service-accounts keys create ~/kafka-connector-key.json \
--iam-account=kafka-connector@YOUR_PROJECT_ID.iam.gserviceaccount.com
Store this key file securely and note its path. You'll reference it in the connector configuration. Never commit this file to version control or expose it publicly.
Step 3: Install the Pub/Sub Kafka Connector
The Google Cloud Pub/Sub Kafka connector enables integration between Kafka and Pub/Sub. Download the connector JAR files and place them in your Kafka Connect plugins directory.
Download the connector from the Confluent Hub or Google Cloud's GitHub repository:
wget https://github.com/googleapis/java-pubsub-group-kafka-connector/releases/download/v1.0.0/pubsub-group-kafka-connector-1.0.0.jar
Copy the connector to your Kafka Connect plugins directory. The exact location depends on your Kafka installation, but it's typically something like this:
mkdir -p /opt/kafka/plugins/pubsub-connector
cp pubsub-group-kafka-connector-1.0.0.jar /opt/kafka/plugins/pubsub-connector/
Update your Kafka Connect worker configuration to include the plugins directory. Edit your connect-distributed.properties
or connect-standalone.properties
file:
plugin.path=/opt/kafka/plugins
Restart your Kafka Connect workers to load the new connector. After restarting, verify the connector is available:
curl -s localhost:8083/connector-plugins | grep PubSub
You should see output listing the Pub/Sub sink and source connector classes. If the connector doesn't appear, check your Kafka Connect logs for classpath or dependency errors.
Step 4: Configure the Sink Connector
The sink connector reads messages from Kafka topics and publishes them to Cloud Pub/Sub. This is the primary component for migrating data out of your on-premises Kafka cluster into Google Cloud.
Create a configuration file for the sink connector. Save this as pubsub-sink-connector.json
:
{
"name": "pubsub-sink-shipment-events",
"config": {
"connector.class": "com.google.pubsub.kafka.sink.CloudPubSubSinkConnector",
"tasks.max": "3",
"topics": "shipment-events",
"cps.project": "YOUR_PROJECT_ID",
"cps.topic": "shipment-events",
"gcp.credentials.file.path": "/path/to/kafka-connector-key.json",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}
The configuration maps your Kafka topic named shipment-events
to the Pub/Sub topic with the same name. The tasks.max
setting controls parallelism, allowing three concurrent tasks to process messages. Adjust this based on your throughput requirements and available resources.
Deploy the connector using the Kafka Connect REST API:
curl -X POST -H "Content-Type: application/json" \
--data @pubsub-sink-connector.json \
http://localhost:8083/connectors
Check the connector status to ensure it started successfully:
curl -s localhost:8083/connectors/pubsub-sink-shipment-events/status | jq
A healthy connector shows a state of RUNNING for both the connector and all tasks. If you see FAILED status, check the Kafka Connect logs for authentication errors or network connectivity issues.
Step 5: Verify Message Flow
Test the sink connector by producing a message to your Kafka topic and confirming it arrives in Pub/Sub. This validation step ensures the integration works before you start migrating production traffic.
Produce a test message to Kafka using the console producer:
echo '{"shipment_id": "SH12345", "location": "Denver", "status": "in_transit", "timestamp": "2024-01-15T14:30:00Z"}' | \
kafka-console-producer --broker-list localhost:9092 --topic shipment-events
Pull the message from Pub/Sub to confirm it arrived:
gcloud pubsub subscriptions pull shipment-events-sub \
--auto-ack \
--limit=1 \
--project=YOUR_PROJECT_ID
You should see your test message in the output. The data field contains the message payload, and you'll also see message attributes and the publish timestamp. If no message appears, check that your Kafka Connect worker is running and the connector status shows RUNNING.
Step 6: Migrate Consumer Applications
With messages flowing from Kafka to Pub/Sub, you can now start migrating consumer applications. This step involves creating new cloud-native services that read from Pub/Sub instead of Kafka.
Here's a Python example of a Pub/Sub subscriber that processes shipment events:
from google.cloud import pubsub_v1
import json
import time
project_id = "YOUR_PROJECT_ID"
subscription_id = "shipment-events-sub"
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)
def callback(message):
data = json.loads(message.data.decode('utf-8'))
print(f"Processing shipment {data['shipment_id']} at {data['location']}")
# Add your business logic here
# For example: update database, trigger alerts, etc.
message.ack()
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print(f"Listening for messages on {subscription_path}...")
try:
streaming_pull_future.result()
except KeyboardInterrupt:
streaming_pull_future.cancel()
Deploy this consumer alongside your existing Kafka consumers. Both can process the same events during the migration period. Monitor both systems to ensure message processing remains consistent.
Step 7: Implement the Source Connector (Optional)
If you need bidirectional sync or want to gradually migrate producers, configure a source connector that reads from Pub/Sub and writes to Kafka. This allows cloud applications to publish to Pub/Sub while legacy Kafka consumers continue operating.
Create a configuration file named pubsub-source-connector.json
:
{
"name": "pubsub-source-shipment-events",
"config": {
"connector.class": "com.google.pubsub.kafka.source.CloudPubSubSourceConnector",
"tasks.max": "2",
"kafka.topic": "cloud-shipment-events",
"cps.project": "YOUR_PROJECT_ID",
"cps.subscription": "cloud-events-for-kafka",
"gcp.credentials.file.path": "/path/to/kafka-connector-key.json",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter"
}
}
First create the Pub/Sub subscription that the source connector will read from:
gcloud pubsub subscriptions create cloud-events-for-kafka \
--topic=shipment-events \
--ack-deadline=60
Deploy the source connector:
curl -X POST -H "Content-Type: application/json" \
--data @pubsub-source-connector.json \
http://localhost:8083/connectors
This setup creates a bridge where messages published to Pub/Sub also appear in Kafka, allowing you to migrate producers independently from consumers.
Monitoring and Troubleshooting
Monitor your hybrid system using both Kafka metrics and Google Cloud's monitoring tools. Key metrics to watch include message throughput, connector lag, error rates, and consumer processing times.
Check Kafka Connect connector status regularly:
curl -s localhost:8083/connectors/pubsub-sink-shipment-events/status
Monitor Pub/Sub metrics in the Google Cloud Console. Navigate to Pub/Sub and select your topic to view publish rates, subscription backlog, and delivery latency. Set up alerts for unusual backlog growth or delivery failures.
Common issues include authentication failures (check your service account key path and permissions), network connectivity problems (verify firewall rules allow outbound HTTPS to Google Cloud), and message format errors (ensure your converters match the data format). Check Kafka Connect logs at /var/log/kafka/connect.log
for detailed error messages.
Completing the Migration
Once you've migrated all consumers and producers to Pub/Sub and verified stability over several days or weeks, you can decommission the Kafka infrastructure. The timeline depends on your organization's risk tolerance and the criticality of the migrated workloads.
Start by stopping the sink connector to prevent new messages from flowing to Pub/Sub:
curl -X PUT localhost:8083/connectors/pubsub-sink-shipment-events/pause
Verify that all applications are reading from and writing to Pub/Sub. Check application logs and metrics to confirm no traffic remains on Kafka topics. Once confirmed, delete the connectors:
curl -X DELETE localhost:8083/connectors/pubsub-sink-shipment-events
curl -X DELETE localhost:8083/connectors/pubsub-source-shipment-events
Archive any important Kafka data before decommissioning the cluster. You can export historical messages to Cloud Storage for long-term retention if needed. Finally, shut down your Kafka brokers and Zookeeper ensemble.
Next Steps
After completing your migration, consider implementing additional Pub/Sub features that weren't available in Kafka. Dead letter topics handle messages that repeatedly fail processing. Message filtering lets subscribers receive only relevant messages without processing overhead. Snapshots allow you to capture subscription state for testing or rollback scenarios.
Integrate Pub/Sub with other Google Cloud services to build event-driven architectures. Connect Pub/Sub to Dataflow for stream processing, trigger Cloud Functions for serverless event handling, or stream directly to BigQuery for real-time analytics. These integrations provide capabilities that required significant custom development in Kafka-based architectures.