Who is Venkataraman Thyagarajan?

Venkataraman Thyagarajan (Venkat) is a Lead Engineer with 8+ years of experience building enterprise-scale data streaming platforms. He specializes in Apache Kafka, Confluent, Snowflake, Databricks, and Google Cloud Platform. He is currently open to global senior and lead engineering opportunities.

What technologies does Venkataraman Thyagarajan specialize in?

Venkataraman specializes in data streaming technologies including Apache Kafka, Confluent Platform, and event-driven architecture. On the cloud side, he works with GCP, AWS, Snowflake, Databricks, and Apache Spark. His backend skills include Java, Spring Boot, Python, and FastAPI. He also builds frontends with React and Next.js.

What has Venkataraman Thyagarajan built professionally?

Venkataraman has built a Customer Data Platform serving 10M+ daily users, unifying 200+ data tables from over a dozen source systems with a 99.9% SLA. He executed zero-downtime Confluent Kafka infrastructure migrations, reduced network latency by 30%, and delivered measurable platform value through real-time data pipelines processing 500K+ events per day.

Is Venkataraman Thyagarajan available for hire?

Yes, Venkataraman is actively open to new opportunities as of 2026. He is looking for Lead Engineer, Principal Engineer, Staff Engineer, or Senior Data/Platform Engineering roles at companies worldwide, with a preference for remote or hybrid arrangements.

Where is Venkataraman Thyagarajan located?

Venkataraman is based in Chennai, Tamil Nadu, India. He is open to fully remote positions or hybrid roles at companies globally.

What is heyvenkat.com?

heyvenkat.com is the personal portfolio and technical blog of Venkataraman Thyagarajan, a Lead Engineer specializing in data streaming and cloud platforms. The site showcases his projects, technical articles on Kafka, Databricks, and cloud architecture, and serves as a professional contact point for recruiters and collaborators.

Building Zero-Downtime Kafka Migrations at Scale

January 20, 2026

12 min read

Kafka

Confluent

Cloud Migration

Platform Engineering

Architecture

Building Zero-Downtime Kafka Migrations at Scale

Why Zero-Downtime Migration Is Non-Negotiable

In 2025, my team at Mr. Cooper executed one of the most complex infrastructure migrations we'd undertaken: moving our entire Confluent Kafka cluster to Private Service Connect (PSC). The cluster handled over 500,000 events per day, serving the mortgage servicing platform for 10M+ customers. Downtime was not an option.

This article documents the strategy, execution, and lessons learned — a framework you can apply to any major Kafka infrastructure migration.

What Is PSC and Why Did We Migrate?

Private Service Connect (PSC) is a Google Cloud networking feature that allows private, encrypted connectivity between consumer and producer VPCs without traversing the public internet. Before PSC, our Confluent Cloud cluster used VPC peering, which introduced several pain points:

Network traffic between services was routable across peered VPCs, creating a larger blast radius for security incidents
IP address range conflicts as our GCP footprint expanded
Inconsistent latency during peak hours due to shared peering bandwidth
Compliance concerns around data traversal paths

PSC solved all of these: traffic stays entirely within Google's network, each service gets a dedicated private endpoint, and the connection model is strictly one-directional (consumer pulls from producer).

Migration Architecture: The Dual-Cluster Strategy

The core principle of zero-downtime Kafka migration is simple: run two clusters in parallel and migrate consumer groups incrementally. We called this our Blue (legacy) / Green (PSC) strategy.

Phase 1: Green Cluster Provisioning (Week 1–2)

We provisioned the new PSC-enabled Confluent cluster in parallel with the existing peering-based cluster:

Created identical topic configurations (partition counts, replication factors, retention policies)
Set up Schema Registry replication to keep schemas in sync
Configured identical ACLs and service accounts
Established monitoring dashboards for both clusters in New Relic
Validated network connectivity from all 40+ consumer service pods

# Example: Verify PSC endpoint connectivity from consumer pod
kubectl exec -it <consumer-pod> -- nc -zv <psc-endpoint> 9092
# Expected: Connection to <psc-endpoint> 9092 port succeeded

Phase 2: Producer Migration (Week 3–4)

We migrated producers first — before any consumers. This allowed the green cluster to accumulate message history before consumers switched over:

Identify producer services — We catalogued every application writing to Kafka (32 services across 8 teams)
Feature-flag the broker URL — Each producer used an environment variable for the bootstrap server address. We added PSC endpoint as a feature-flagged override
Shadow publishing — For critical topics (loan events, customer updates), we temporarily dual-published to both clusters using a custom wrapper that wrote to both brokers with identical keys
Validate with lag monitoring — Confirmed consumer lag remained at zero on blue cluster before declaring producer migration complete

// Producer dual-write wrapper (simplified)
public void send(ProducerRecord<String, Object> record) {
    blueProducer.send(record);  // Legacy cluster
    if (pscFeatureEnabled) {
        greenProducer.send(record);  // PSC cluster
    }
}

Phase 3: Consumer Group Migration (Week 5–6)

This was the most delicate phase. We migrated consumer groups team-by-team, with a 48-hour overlap window for each:

Reset consumer group offsets on green cluster to match current position on blue: kafka-consumer-groups --reset-offsets --to-latest
Deploy consumer to green with bootstrap.servers pointing to PSC endpoint
Monitor for 48 hours: consumer lag, error rates, processing latency
Decommission blue consumer only after validation passes

Critical lesson: Never migrate producers and consumers for the same topic simultaneously. Always finish producer migration first, let offset history build on green, then migrate consumers. This gives you a replay buffer if anything goes wrong.

Handling the Hard Cases

Exactly-Once Semantics (EOS) Producers

Several of our critical producers used Kafka transactions for exactly-once semantics. These required special handling because transaction IDs are cluster-scoped:

Generate new unique transactional.id values for green cluster (avoid collision with blue)
Disable dual-write for EOS producers — the transactional guarantees can't span clusters
Use a hard cutover for EOS producers during a low-traffic window (Sunday 2–4 AM)
Maintain blue cluster availability for 72 hours post-cutover as rollback option

Schema Evolution During Migration

We had 140+ schemas registered in Confluent Schema Registry. Rather than replicating schemas manually, we used Confluent's Schema Registry migration tooling:

# Export schemas from blue registry
confluent schema-registry cluster export --config blue-config.properties > schemas.json

# Import to green registry  
confluent schema-registry cluster import --config green-config.properties --file schemas.json

Consumer Groups With Long Retention Windows

Some consumer groups (analytics, audit) had retention requirements of 7–14 days and needed to replay historical events. We handled this by:

Keeping the blue cluster running for 14 additional days after consumer migration
Setting offsets on green to the equivalent position via timestamp reset
Running a one-time historical replay job from blue to green for gaps during the dual-write window

Monitoring and Rollback Strategy

We defined clear success/rollback criteria upfront and automated the rollback trigger:

Success Criteria (per consumer group):

Consumer lag < 100 messages for 4 consecutive hours
Error rate < 0.01% for 48 hours
P99 message processing latency within 10% of baseline
No dead letter queue accumulation

Automated Rollback Trigger:

# New Relic alert condition: rollback trigger
SELECT count(*) FROM KafkaConsumerMetrics 
WHERE consumer_lag > 10000 
AND cluster = 'green-psc'
SINCE 15 minutes ago

Results

The migration completed 2 months ahead of schedule with:

Zero downtime — not a single dropped message or consumer interruption
-30% end-to-end latency — PSC's dedicated private endpoints eliminated peering bottlenecks
+40% security posture — eliminated public internet traversal, reduced blast radius
100% schema compatibility — all 140+ schemas migrated without version conflicts
32 producer services and 40+ consumer service pods migrated without incident

Key Lessons

Always migrate producers before consumers — this gives consumers a buffer of historical events on the new cluster
Feature-flag everything — broker URLs, schema registry URLs, consumer group IDs should all be configuration-driven
Define rollback criteria before you start — not during the migration when you're under pressure
Shadow-publish for critical topics — the cost of dual-writing is worth the safety net
Automate the offset reset — manual offset management at scale is error-prone
Communicate early and often — all 8 teams knew the migration calendar 6 weeks in advance

Zero-downtime migrations aren't magic. They're the result of careful planning, incremental execution, and a clear rollback strategy at every step. The techniques described here apply equally to Kafka version upgrades, cloud-to-cloud migrations, or any major streaming infrastructure change.

Back to Articles

Building a Customer Data Platform: Architecture for 10M+ Users

Architecture decisions, data modeling patterns, and operational lessons from building a Customer Data Platform at Mr. Cooper that unified 200+ tables from 10+ source systems, serving 10M+ daily users.

March 10, 202614 min read

Kafka vs Databricks: When to Use Each (Decision Framework)

A practical decision framework for choosing between Kafka, Databricks, JDBC connectors, and Kafka Connect for real-time streaming, batch processing, and CDC ingestion in modern data architectures.

November 15, 202515 min read

Hey Venkat