Building a Customer Data Platform: Architecture for 10M+ Users

March 10, 2026

14 min read

Data Engineering
Customer Data Platform
Kafka
Snowflake
Databricks
Architecture
Building a Customer Data Platform: Architecture for 10M+ Users

The Problem: Data Silos at Mortgage Scale

When I joined the data platform team at Mr. Cooper in 2018, customer data was scattered across 10+ source systems: loan origination, servicing, payment processing, CRM, collections, escrow, and more. Every team that needed a complete customer view had to join data from multiple systems — sometimes taking hours, sometimes days. Critical decisions (risk assessment, customer outreach, fraud detection) were being made on stale or incomplete data.

By 2021, we had designed and built a Customer Data Platform (CDP) that unified all of this into a single, real-time accessible layer serving 10M+ daily users. This article documents the architectural decisions, the tradeoffs we made, and what I'd do differently.

Data architecture and platform design

Architecture Overview: The Medallion Approach

We adopted a medallion architecture (Bronze → Silver → Gold) built on Confluent Kafka, Databricks, and Snowflake:

  • Bronze Layer (Raw Ingestion) — Every change event from source systems lands here via Kafka Connect CDC connectors (Debezium). No transformation, just raw events with metadata
  • Silver Layer (Cleansed & Unified) — Databricks Structured Streaming jobs process bronze events, apply schema validation, deduplication, and entity resolution. Output to Delta Lake
  • Gold Layer (Business-Ready) — Aggregated customer profiles, 360-degree views, feature sets for ML models. Stored in Snowflake for SQL access, replicated back to Kafka for real-time consumers
Source DB → Debezium CDC → Kafka (Bronze topics)
                                  ↓
              Databricks Structured Streaming (Silver)
                                  ↓  
              Delta Lake (Silver tables) → Snowflake (Gold)
                                  ↓
              Kafka (Gold events) → Real-time consumers

Decision 1: CDC Over Batch Polling

The first and most important architectural decision: how do we get data out of 10+ relational source systems?

We evaluated three options:

  1. Batch polling (JDBC) — Simple, no additional infrastructure, but poll intervals create 15–60 minute data lag and miss deletes entirely
  2. Application-level events — Developers emit events on every write. Clean but requires coordination with 20+ application teams and introduces event drift over time
  3. Database CDC (Debezium) — Reads the database transaction log. Captures all changes (inserts, updates, deletes) with sub-second latency, no application changes required

We chose CDC. The setup cost (deploying and operating Debezium connectors) was worth it — we captured 100% of data changes including deletes, with <2 second lag from source to Bronze.

Lesson: If you have more than 2 source systems and need <5 minute data freshness, CDC is almost always the right choice despite the operational overhead.

Decision 2: Entity Resolution Strategy

The hardest problem in any CDP is entity resolution — matching records across systems that represent the same customer but have different IDs, slightly different names, or inconsistent email addresses.

We built a three-tier matching strategy:

Tier 1: Deterministic Matching

Exact match on Social Security Number (SSN), loan account number, or email address (normalized to lowercase, trimmed). This handles ~75% of records.

Tier 2: Probabilistic Matching

For records without exact match keys, we used a scoring model combining:

  • Name similarity (Levenshtein distance, normalized)
  • Address similarity (street, city, zip)
  • Phone number match
  • Date of birth match

Records scoring above 0.85 were auto-merged. Records between 0.60–0.85 went to a human review queue.

Tier 3: Manual Override

Ops team could manually link or unlink records. These overrides were stored as first-class events in Kafka and took highest priority in all resolution decisions.

Data unification and entity matching

Decision 3: The 200+ Table Problem

Unifying 200+ source tables sounds daunting. Here's how we made it tractable:

Domain-Based Partitioning

We organized tables into 8 domains: Customer Identity, Loan, Payment, Escrow, Collections, Communications, Documents, and Risk. Each domain had an owning team responsible for its Silver-layer transformations.

Canonical Schema

We defined a canonical Customer entity schema with ~120 fields. Every source system's data was mapped to this schema by the owning domain team. New fields could be added but existing fields couldn't be removed or renamed without a deprecation cycle.

-- Example: Canonical customer identity fields
customer_id          VARCHAR  -- CDP-generated UUID
ssn_hash             VARCHAR  -- SHA-256 of SSN (never plain SSN)
first_name           VARCHAR
last_name            VARCHAR
email                VARCHAR  -- Normalized
phone_primary        VARCHAR  -- E.164 format
address_line1        VARCHAR
city                 VARCHAR
state                CHAR(2)
zip                  CHAR(5)
created_at           TIMESTAMP
updated_at           TIMESTAMP  -- CDP-level, not source
source_system        VARCHAR  -- Origin system ID
source_system_id     VARCHAR  -- ID in origin system

Schema Governance

All schemas were versioned in Confluent Schema Registry with backward compatibility enforced. Producers couldn't publish breaking schema changes — they had to be approved through a schema review process. This saved us from countless silent data breakages.

Decision 4: Real-Time vs. Batch Serving

Not all use cases have the same freshness requirements:

Use Case Latency Requirement Serving Layer
Fraud detection <100ms Kafka consumer + Redis cache
Customer service lookup <500ms REST API over Elasticsearch index
Marketing segmentation 1–4 hours Snowflake + scheduled refresh
Compliance reporting Daily Snowflake overnight batch
ML feature engineering Hourly Databricks + Feast feature store

The key insight: don't build one serving layer to rule them all. Build the right serving layer for each latency tier and populate them all from the same unified Gold layer.

Operational Lessons

1. Schema Evolution Is Your Biggest Risk

Source systems change schemas without warning. We caught 40+ schema-breaking changes in the first year because Confluent Schema Registry rejected them at the connector level before they could corrupt downstream data. Invest heavily in schema governance early.

2. Monitor Consumer Lag, Not Just Throughput

Throughput metrics looked healthy during several incidents because messages were still flowing. Consumer lag was the real signal — it told us when a processing bottleneck was forming 20 minutes before it became a customer-facing problem.

3. Data Quality Gates Are Non-Negotiable

We added quality validation at the Silver layer: null checks, referential integrity, value range validation. Records failing validation went to a dead letter topic rather than silently corrupting the Gold layer. This increased trust across all downstream teams dramatically.

4. Backfill Strategy Matters

When you launch a CDP, you need to populate it with historical data. We underestimated this. Our backfill took 3 weeks because we tried to run it through the streaming pipeline at full volume. In retrospect: build a dedicated batch backfill path from day one, separate from the streaming path.

5. Ownership Requires SLA Commitment

We assigned each domain team a data quality SLA: 99.9% of events processed within 5 minutes. This accountability created the right incentives for teams to own their connectors, schemas, and Silver transformations properly.

Results After 4 Years

  • 200+ tables unified from 10+ source systems
  • 99.9% SLA on event processing timeliness
  • 75% faster customer data lookups (Elasticsearch index vs. previous multi-system joins)
  • $2M+ infrastructure savings by decommissioning redundant reporting ETL pipelines
  • Zero data quality incidents reaching production in the last 18 months
  • 10M+ daily users served across fraud, customer service, marketing, and compliance

Building a CDP is a multi-year investment. The first year is hard — schema fights, entity resolution edge cases, backfill headaches. But the compounding value as more use cases are unlocked from a single trusted data layer is transformative. Start with CDC, invest in schema governance, and resist the temptation to build one serving layer for all use cases.