Building Zero-Downtime Kafka Migrations at Scale
January 20, 2026
12 min read
Why Zero-Downtime Migration Is Non-Negotiable
In 2025, my team at Mr. Cooper executed one of the most complex infrastructure migrations we'd undertaken: moving our entire Confluent Kafka cluster to Private Service Connect (PSC). The cluster handled over 500,000 events per day, serving the mortgage servicing platform for 10M+ customers. Downtime was not an option.
This article documents the strategy, execution, and lessons learned — a framework you can apply to any major Kafka infrastructure migration.
What Is PSC and Why Did We Migrate?
Private Service Connect (PSC) is a Google Cloud networking feature that allows private, encrypted connectivity between consumer and producer VPCs without traversing the public internet. Before PSC, our Confluent Cloud cluster used VPC peering, which introduced several pain points:
- Network traffic between services was routable across peered VPCs, creating a larger blast radius for security incidents
- IP address range conflicts as our GCP footprint expanded
- Inconsistent latency during peak hours due to shared peering bandwidth
- Compliance concerns around data traversal paths
PSC solved all of these: traffic stays entirely within Google's network, each service gets a dedicated private endpoint, and the connection model is strictly one-directional (consumer pulls from producer).
Migration Architecture: The Dual-Cluster Strategy
The core principle of zero-downtime Kafka migration is simple: run two clusters in parallel and migrate consumer groups incrementally. We called this our Blue (legacy) / Green (PSC) strategy.
Phase 1: Green Cluster Provisioning (Week 1–2)
We provisioned the new PSC-enabled Confluent cluster in parallel with the existing peering-based cluster:
- Created identical topic configurations (partition counts, replication factors, retention policies)
- Set up Schema Registry replication to keep schemas in sync
- Configured identical ACLs and service accounts
- Established monitoring dashboards for both clusters in New Relic
- Validated network connectivity from all 40+ consumer service pods
# Example: Verify PSC endpoint connectivity from consumer pod
kubectl exec -it <consumer-pod> -- nc -zv <psc-endpoint> 9092
# Expected: Connection to <psc-endpoint> 9092 port succeeded
Phase 2: Producer Migration (Week 3–4)
We migrated producers first — before any consumers. This allowed the green cluster to accumulate message history before consumers switched over:
- Identify producer services — We catalogued every application writing to Kafka (32 services across 8 teams)
- Feature-flag the broker URL — Each producer used an environment variable for the bootstrap server address. We added PSC endpoint as a feature-flagged override
- Shadow publishing — For critical topics (loan events, customer updates), we temporarily dual-published to both clusters using a custom wrapper that wrote to both brokers with identical keys
- Validate with lag monitoring — Confirmed consumer lag remained at zero on blue cluster before declaring producer migration complete
// Producer dual-write wrapper (simplified)
public void send(ProducerRecord<String, Object> record) {
blueProducer.send(record); // Legacy cluster
if (pscFeatureEnabled) {
greenProducer.send(record); // PSC cluster
}
}
Phase 3: Consumer Group Migration (Week 5–6)
This was the most delicate phase. We migrated consumer groups team-by-team, with a 48-hour overlap window for each:
- Reset consumer group offsets on green cluster to match current position on blue:
kafka-consumer-groups --reset-offsets --to-latest - Deploy consumer to green with
bootstrap.serverspointing to PSC endpoint - Monitor for 48 hours: consumer lag, error rates, processing latency
- Decommission blue consumer only after validation passes
Critical lesson: Never migrate producers and consumers for the same topic simultaneously. Always finish producer migration first, let offset history build on green, then migrate consumers. This gives you a replay buffer if anything goes wrong.
Handling the Hard Cases
Exactly-Once Semantics (EOS) Producers
Several of our critical producers used Kafka transactions for exactly-once semantics. These required special handling because transaction IDs are cluster-scoped:
- Generate new unique
transactional.idvalues for green cluster (avoid collision with blue) - Disable dual-write for EOS producers — the transactional guarantees can't span clusters
- Use a hard cutover for EOS producers during a low-traffic window (Sunday 2–4 AM)
- Maintain blue cluster availability for 72 hours post-cutover as rollback option
Schema Evolution During Migration
We had 140+ schemas registered in Confluent Schema Registry. Rather than replicating schemas manually, we used Confluent's Schema Registry migration tooling:
# Export schemas from blue registry
confluent schema-registry cluster export --config blue-config.properties > schemas.json
# Import to green registry
confluent schema-registry cluster import --config green-config.properties --file schemas.json
Consumer Groups With Long Retention Windows
Some consumer groups (analytics, audit) had retention requirements of 7–14 days and needed to replay historical events. We handled this by:
- Keeping the blue cluster running for 14 additional days after consumer migration
- Setting offsets on green to the equivalent position via timestamp reset
- Running a one-time historical replay job from blue to green for gaps during the dual-write window
Monitoring and Rollback Strategy
We defined clear success/rollback criteria upfront and automated the rollback trigger:
Success Criteria (per consumer group):
- Consumer lag < 100 messages for 4 consecutive hours
- Error rate < 0.01% for 48 hours
- P99 message processing latency within 10% of baseline
- No dead letter queue accumulation
Automated Rollback Trigger:
# New Relic alert condition: rollback trigger
SELECT count(*) FROM KafkaConsumerMetrics
WHERE consumer_lag > 10000
AND cluster = 'green-psc'
SINCE 15 minutes ago
Results
The migration completed 2 months ahead of schedule with:
- Zero downtime — not a single dropped message or consumer interruption
- -30% end-to-end latency — PSC's dedicated private endpoints eliminated peering bottlenecks
- +40% security posture — eliminated public internet traversal, reduced blast radius
- 100% schema compatibility — all 140+ schemas migrated without version conflicts
- 32 producer services and 40+ consumer service pods migrated without incident
Key Lessons
- Always migrate producers before consumers — this gives consumers a buffer of historical events on the new cluster
- Feature-flag everything — broker URLs, schema registry URLs, consumer group IDs should all be configuration-driven
- Define rollback criteria before you start — not during the migration when you're under pressure
- Shadow-publish for critical topics — the cost of dual-writing is worth the safety net
- Automate the offset reset — manual offset management at scale is error-prone
- Communicate early and often — all 8 teams knew the migration calendar 6 weeks in advance
Zero-downtime migrations aren't magic. They're the result of careful planning, incremental execution, and a clear rollback strategy at every step. The techniques described here apply equally to Kafka version upgrades, cloud-to-cloud migrations, or any major streaming infrastructure change.