Building Zero-Downtime Kafka Migrations at Scale

January 20, 2026

12 min read

Kafka
Confluent
Cloud Migration
Platform Engineering
Architecture
Building Zero-Downtime Kafka Migrations at Scale

Why Zero-Downtime Migration Is Non-Negotiable

In 2025, my team at Mr. Cooper executed one of the most complex infrastructure migrations we'd undertaken: moving our entire Confluent Kafka cluster to Private Service Connect (PSC). The cluster handled over 500,000 events per day, serving the mortgage servicing platform for 10M+ customers. Downtime was not an option.

This article documents the strategy, execution, and lessons learned — a framework you can apply to any major Kafka infrastructure migration.

What Is PSC and Why Did We Migrate?

Private Service Connect (PSC) is a Google Cloud networking feature that allows private, encrypted connectivity between consumer and producer VPCs without traversing the public internet. Before PSC, our Confluent Cloud cluster used VPC peering, which introduced several pain points:

  • Network traffic between services was routable across peered VPCs, creating a larger blast radius for security incidents
  • IP address range conflicts as our GCP footprint expanded
  • Inconsistent latency during peak hours due to shared peering bandwidth
  • Compliance concerns around data traversal paths

PSC solved all of these: traffic stays entirely within Google's network, each service gets a dedicated private endpoint, and the connection model is strictly one-directional (consumer pulls from producer).

Migration Architecture: The Dual-Cluster Strategy

The core principle of zero-downtime Kafka migration is simple: run two clusters in parallel and migrate consumer groups incrementally. We called this our Blue (legacy) / Green (PSC) strategy.

Cloud architecture diagram concept

Phase 1: Green Cluster Provisioning (Week 1–2)

We provisioned the new PSC-enabled Confluent cluster in parallel with the existing peering-based cluster:

  • Created identical topic configurations (partition counts, replication factors, retention policies)
  • Set up Schema Registry replication to keep schemas in sync
  • Configured identical ACLs and service accounts
  • Established monitoring dashboards for both clusters in New Relic
  • Validated network connectivity from all 40+ consumer service pods
# Example: Verify PSC endpoint connectivity from consumer pod
kubectl exec -it <consumer-pod> -- nc -zv <psc-endpoint> 9092
# Expected: Connection to <psc-endpoint> 9092 port succeeded

Phase 2: Producer Migration (Week 3–4)

We migrated producers first — before any consumers. This allowed the green cluster to accumulate message history before consumers switched over:

  1. Identify producer services — We catalogued every application writing to Kafka (32 services across 8 teams)
  2. Feature-flag the broker URL — Each producer used an environment variable for the bootstrap server address. We added PSC endpoint as a feature-flagged override
  3. Shadow publishing — For critical topics (loan events, customer updates), we temporarily dual-published to both clusters using a custom wrapper that wrote to both brokers with identical keys
  4. Validate with lag monitoring — Confirmed consumer lag remained at zero on blue cluster before declaring producer migration complete
// Producer dual-write wrapper (simplified)
public void send(ProducerRecord<String, Object> record) {
    blueProducer.send(record);  // Legacy cluster
    if (pscFeatureEnabled) {
        greenProducer.send(record);  // PSC cluster
    }
}

Phase 3: Consumer Group Migration (Week 5–6)

This was the most delicate phase. We migrated consumer groups team-by-team, with a 48-hour overlap window for each:

  1. Reset consumer group offsets on green cluster to match current position on blue: kafka-consumer-groups --reset-offsets --to-latest
  2. Deploy consumer to green with bootstrap.servers pointing to PSC endpoint
  3. Monitor for 48 hours: consumer lag, error rates, processing latency
  4. Decommission blue consumer only after validation passes

Critical lesson: Never migrate producers and consumers for the same topic simultaneously. Always finish producer migration first, let offset history build on green, then migrate consumers. This gives you a replay buffer if anything goes wrong.

Handling the Hard Cases

Exactly-Once Semantics (EOS) Producers

Several of our critical producers used Kafka transactions for exactly-once semantics. These required special handling because transaction IDs are cluster-scoped:

  • Generate new unique transactional.id values for green cluster (avoid collision with blue)
  • Disable dual-write for EOS producers — the transactional guarantees can't span clusters
  • Use a hard cutover for EOS producers during a low-traffic window (Sunday 2–4 AM)
  • Maintain blue cluster availability for 72 hours post-cutover as rollback option

Schema Evolution During Migration

We had 140+ schemas registered in Confluent Schema Registry. Rather than replicating schemas manually, we used Confluent's Schema Registry migration tooling:

# Export schemas from blue registry
confluent schema-registry cluster export --config blue-config.properties > schemas.json

# Import to green registry  
confluent schema-registry cluster import --config green-config.properties --file schemas.json

Consumer Groups With Long Retention Windows

Some consumer groups (analytics, audit) had retention requirements of 7–14 days and needed to replay historical events. We handled this by:

  • Keeping the blue cluster running for 14 additional days after consumer migration
  • Setting offsets on green to the equivalent position via timestamp reset
  • Running a one-time historical replay job from blue to green for gaps during the dual-write window

Monitoring and Rollback Strategy

We defined clear success/rollback criteria upfront and automated the rollback trigger:

Success Criteria (per consumer group):

  • Consumer lag < 100 messages for 4 consecutive hours
  • Error rate < 0.01% for 48 hours
  • P99 message processing latency within 10% of baseline
  • No dead letter queue accumulation

Automated Rollback Trigger:

# New Relic alert condition: rollback trigger
SELECT count(*) FROM KafkaConsumerMetrics 
WHERE consumer_lag > 10000 
AND cluster = 'green-psc'
SINCE 15 minutes ago
Monitoring and observability dashboard

Results

The migration completed 2 months ahead of schedule with:

  • Zero downtime — not a single dropped message or consumer interruption
  • -30% end-to-end latency — PSC's dedicated private endpoints eliminated peering bottlenecks
  • +40% security posture — eliminated public internet traversal, reduced blast radius
  • 100% schema compatibility — all 140+ schemas migrated without version conflicts
  • 32 producer services and 40+ consumer service pods migrated without incident

Key Lessons

  1. Always migrate producers before consumers — this gives consumers a buffer of historical events on the new cluster
  2. Feature-flag everything — broker URLs, schema registry URLs, consumer group IDs should all be configuration-driven
  3. Define rollback criteria before you start — not during the migration when you're under pressure
  4. Shadow-publish for critical topics — the cost of dual-writing is worth the safety net
  5. Automate the offset reset — manual offset management at scale is error-prone
  6. Communicate early and often — all 8 teams knew the migration calendar 6 weeks in advance

Zero-downtime migrations aren't magic. They're the result of careful planning, incremental execution, and a clear rollback strategy at every step. The techniques described here apply equally to Kafka version upgrades, cloud-to-cloud migrations, or any major streaming infrastructure change.