Who is Venkataraman Thyagarajan?

Venkataraman Thyagarajan (Venkat) is a Lead Engineer with 8+ years of experience building enterprise-scale data streaming platforms. He specializes in Apache Kafka, Confluent, Snowflake, Databricks, and Google Cloud Platform. He is currently open to global senior and lead engineering opportunities.

What technologies does Venkataraman Thyagarajan specialize in?

Venkataraman specializes in data streaming technologies including Apache Kafka, Confluent Platform, and event-driven architecture. On the cloud side, he works with GCP, AWS, Snowflake, Databricks, and Apache Spark. His backend skills include Java, Spring Boot, Python, and FastAPI. He also builds frontends with React and Next.js.

What has Venkataraman Thyagarajan built professionally?

Venkataraman has built a Customer Data Platform serving 10M+ daily users, unifying 200+ data tables from over a dozen source systems with a 99.9% SLA. He executed zero-downtime Confluent Kafka infrastructure migrations, reduced network latency by 30%, and delivered measurable platform value through real-time data pipelines processing 500K+ events per day.

Is Venkataraman Thyagarajan available for hire?

Yes, Venkataraman is actively open to new opportunities as of 2026. He is looking for Lead Engineer, Principal Engineer, Staff Engineer, or Senior Data/Platform Engineering roles at companies worldwide, with a preference for remote or hybrid arrangements.

Where is Venkataraman Thyagarajan located?

Venkataraman is based in Chennai, Tamil Nadu, India. He is open to fully remote positions or hybrid roles at companies globally.

What is heyvenkat.com?

heyvenkat.com is the personal portfolio and technical blog of Venkataraman Thyagarajan, a Lead Engineer specializing in data streaming and cloud platforms. The site showcases his projects, technical articles on Kafka, Databricks, and cloud architecture, and serves as a professional contact point for recruiters and collaborators.

Building a Customer Data Platform: Architecture for 10M+ Users

March 10, 2026

14 min read

Data Engineering

Customer Data Platform

Kafka

Snowflake

Databricks

Architecture

The Problem: Data Silos at Mortgage Scale

When I joined the data platform team at Mr. Cooper in 2018, customer data was scattered across 10+ source systems: loan origination, servicing, payment processing, CRM, collections, escrow, and more. Every team that needed a complete customer view had to join data from multiple systems — sometimes taking hours, sometimes days. Critical decisions (risk assessment, customer outreach, fraud detection) were being made on stale or incomplete data.

By 2021, we had designed and built a Customer Data Platform (CDP) that unified all of this into a single, real-time accessible layer serving 10M+ daily users. This article documents the architectural decisions, the tradeoffs we made, and what I'd do differently.

Architecture Overview: The Medallion Approach

We adopted a medallion architecture (Bronze → Silver → Gold) built on Confluent Kafka, Databricks, and Snowflake:

Bronze Layer (Raw Ingestion) — Every change event from source systems lands here via Kafka Connect CDC connectors (Debezium). No transformation, just raw events with metadata
Silver Layer (Cleansed & Unified) — Databricks Structured Streaming jobs process bronze events, apply schema validation, deduplication, and entity resolution. Output to Delta Lake
Gold Layer (Business-Ready) — Aggregated customer profiles, 360-degree views, feature sets for ML models. Stored in Snowflake for SQL access, replicated back to Kafka for real-time consumers

Source DB → Debezium CDC → Kafka (Bronze topics)
                                  ↓
              Databricks Structured Streaming (Silver)
                                  ↓  
              Delta Lake (Silver tables) → Snowflake (Gold)
                                  ↓
              Kafka (Gold events) → Real-time consumers

Decision 1: CDC Over Batch Polling

The first and most important architectural decision: how do we get data out of 10+ relational source systems?

We evaluated three options:

Batch polling (JDBC) — Simple, no additional infrastructure, but poll intervals create 15–60 minute data lag and miss deletes entirely
Application-level events — Developers emit events on every write. Clean but requires coordination with 20+ application teams and introduces event drift over time
Database CDC (Debezium) — Reads the database transaction log. Captures all changes (inserts, updates, deletes) with sub-second latency, no application changes required

We chose CDC. The setup cost (deploying and operating Debezium connectors) was worth it — we captured 100% of data changes including deletes, with <2 second lag from source to Bronze.

Lesson: If you have more than 2 source systems and need <5 minute data freshness, CDC is almost always the right choice despite the operational overhead.

Decision 2: Entity Resolution Strategy

The hardest problem in any CDP is entity resolution — matching records across systems that represent the same customer but have different IDs, slightly different names, or inconsistent email addresses.

We built a three-tier matching strategy:

Tier 1: Deterministic Matching

Exact match on Social Security Number (SSN), loan account number, or email address (normalized to lowercase, trimmed). This handles ~75% of records.

Tier 2: Probabilistic Matching

For records without exact match keys, we used a scoring model combining:

Name similarity (Levenshtein distance, normalized)
Address similarity (street, city, zip)
Phone number match
Date of birth match

Records scoring above 0.85 were auto-merged. Records between 0.60–0.85 went to a human review queue.

Tier 3: Manual Override

Ops team could manually link or unlink records. These overrides were stored as first-class events in Kafka and took highest priority in all resolution decisions.

Decision 3: The 200+ Table Problem

Unifying 200+ source tables sounds daunting. Here's how we made it tractable:

Domain-Based Partitioning

We organized tables into 8 domains: Customer Identity, Loan, Payment, Escrow, Collections, Communications, Documents, and Risk. Each domain had an owning team responsible for its Silver-layer transformations.

Canonical Schema

We defined a canonical Customer entity schema with ~120 fields. Every source system's data was mapped to this schema by the owning domain team. New fields could be added but existing fields couldn't be removed or renamed without a deprecation cycle.

-- Example: Canonical customer identity fields
customer_id          VARCHAR  -- CDP-generated UUID
ssn_hash             VARCHAR  -- SHA-256 of SSN (never plain SSN)
first_name           VARCHAR
last_name            VARCHAR
email                VARCHAR  -- Normalized
phone_primary        VARCHAR  -- E.164 format
address_line1        VARCHAR
city                 VARCHAR
state                CHAR(2)
zip                  CHAR(5)
created_at           TIMESTAMP
updated_at           TIMESTAMP  -- CDP-level, not source
source_system        VARCHAR  -- Origin system ID
source_system_id     VARCHAR  -- ID in origin system

Schema Governance

All schemas were versioned in Confluent Schema Registry with backward compatibility enforced. Producers couldn't publish breaking schema changes — they had to be approved through a schema review process. This saved us from countless silent data breakages.

Decision 4: Real-Time vs. Batch Serving

Not all use cases have the same freshness requirements:

Use Case	Latency Requirement	Serving Layer
Fraud detection	<100ms	Kafka consumer + Redis cache
Customer service lookup	<500ms	REST API over Elasticsearch index
Marketing segmentation	1–4 hours	Snowflake + scheduled refresh
Compliance reporting	Daily	Snowflake overnight batch
ML feature engineering	Hourly	Databricks + Feast feature store

The key insight: don't build one serving layer to rule them all. Build the right serving layer for each latency tier and populate them all from the same unified Gold layer.

Operational Lessons

1. Schema Evolution Is Your Biggest Risk

Source systems change schemas without warning. We caught 40+ schema-breaking changes in the first year because Confluent Schema Registry rejected them at the connector level before they could corrupt downstream data. Invest heavily in schema governance early.

2. Monitor Consumer Lag, Not Just Throughput

Throughput metrics looked healthy during several incidents because messages were still flowing. Consumer lag was the real signal — it told us when a processing bottleneck was forming 20 minutes before it became a customer-facing problem.

3. Data Quality Gates Are Non-Negotiable

We added quality validation at the Silver layer: null checks, referential integrity, value range validation. Records failing validation went to a dead letter topic rather than silently corrupting the Gold layer. This increased trust across all downstream teams dramatically.

4. Backfill Strategy Matters

When you launch a CDP, you need to populate it with historical data. We underestimated this. Our backfill took 3 weeks because we tried to run it through the streaming pipeline at full volume. In retrospect: build a dedicated batch backfill path from day one, separate from the streaming path.

5. Ownership Requires SLA Commitment

We assigned each domain team a data quality SLA: 99.9% of events processed within 5 minutes. This accountability created the right incentives for teams to own their connectors, schemas, and Silver transformations properly.

Results After 4 Years

200+ tables unified from 10+ source systems
99.9% SLA on event processing timeliness
75% faster customer data lookups (Elasticsearch index vs. previous multi-system joins)
$2M+ infrastructure savings by decommissioning redundant reporting ETL pipelines
Zero data quality incidents reaching production in the last 18 months
10M+ daily users served across fraud, customer service, marketing, and compliance

Building a CDP is a multi-year investment. The first year is hard — schema fights, entity resolution edge cases, backfill headaches. But the compounding value as more use cases are unlocked from a single trusted data layer is transformative. Start with CDC, invest in schema governance, and resist the temptation to build one serving layer for all use cases.

Back to Articles

Building Zero-Downtime Kafka Migrations at Scale

A practical guide to migrating Kafka infrastructure without service interruption — lessons from executing the Confluent PSC migration at Mr. Cooper, 2 months ahead of schedule.

January 20, 202612 min read

Kafka vs Databricks: When to Use Each (Decision Framework)

A practical decision framework for choosing between Kafka, Databricks, JDBC connectors, and Kafka Connect for real-time streaming, batch processing, and CDC ingestion in modern data architectures.

November 15, 202515 min read

Hey Venkat