Building a Customer Data Platform: Architecture for 10M+ Users
March 10, 2026
14 min read
The Problem: Data Silos at Mortgage Scale
When I joined the data platform team at Mr. Cooper in 2018, customer data was scattered across 10+ source systems: loan origination, servicing, payment processing, CRM, collections, escrow, and more. Every team that needed a complete customer view had to join data from multiple systems — sometimes taking hours, sometimes days. Critical decisions (risk assessment, customer outreach, fraud detection) were being made on stale or incomplete data.
By 2021, we had designed and built a Customer Data Platform (CDP) that unified all of this into a single, real-time accessible layer serving 10M+ daily users. This article documents the architectural decisions, the tradeoffs we made, and what I'd do differently.
Architecture Overview: The Medallion Approach
We adopted a medallion architecture (Bronze → Silver → Gold) built on Confluent Kafka, Databricks, and Snowflake:
- Bronze Layer (Raw Ingestion) — Every change event from source systems lands here via Kafka Connect CDC connectors (Debezium). No transformation, just raw events with metadata
- Silver Layer (Cleansed & Unified) — Databricks Structured Streaming jobs process bronze events, apply schema validation, deduplication, and entity resolution. Output to Delta Lake
- Gold Layer (Business-Ready) — Aggregated customer profiles, 360-degree views, feature sets for ML models. Stored in Snowflake for SQL access, replicated back to Kafka for real-time consumers
Source DB → Debezium CDC → Kafka (Bronze topics)
↓
Databricks Structured Streaming (Silver)
↓
Delta Lake (Silver tables) → Snowflake (Gold)
↓
Kafka (Gold events) → Real-time consumers
Decision 1: CDC Over Batch Polling
The first and most important architectural decision: how do we get data out of 10+ relational source systems?
We evaluated three options:
- Batch polling (JDBC) — Simple, no additional infrastructure, but poll intervals create 15–60 minute data lag and miss deletes entirely
- Application-level events — Developers emit events on every write. Clean but requires coordination with 20+ application teams and introduces event drift over time
- Database CDC (Debezium) — Reads the database transaction log. Captures all changes (inserts, updates, deletes) with sub-second latency, no application changes required
We chose CDC. The setup cost (deploying and operating Debezium connectors) was worth it — we captured 100% of data changes including deletes, with <2 second lag from source to Bronze.
Lesson: If you have more than 2 source systems and need <5 minute data freshness, CDC is almost always the right choice despite the operational overhead.
Decision 2: Entity Resolution Strategy
The hardest problem in any CDP is entity resolution — matching records across systems that represent the same customer but have different IDs, slightly different names, or inconsistent email addresses.
We built a three-tier matching strategy:
Tier 1: Deterministic Matching
Exact match on Social Security Number (SSN), loan account number, or email address (normalized to lowercase, trimmed). This handles ~75% of records.
Tier 2: Probabilistic Matching
For records without exact match keys, we used a scoring model combining:
- Name similarity (Levenshtein distance, normalized)
- Address similarity (street, city, zip)
- Phone number match
- Date of birth match
Records scoring above 0.85 were auto-merged. Records between 0.60–0.85 went to a human review queue.
Tier 3: Manual Override
Ops team could manually link or unlink records. These overrides were stored as first-class events in Kafka and took highest priority in all resolution decisions.
Decision 3: The 200+ Table Problem
Unifying 200+ source tables sounds daunting. Here's how we made it tractable:
Domain-Based Partitioning
We organized tables into 8 domains: Customer Identity, Loan, Payment, Escrow, Collections, Communications, Documents, and Risk. Each domain had an owning team responsible for its Silver-layer transformations.
Canonical Schema
We defined a canonical Customer entity schema with ~120 fields. Every source system's data was mapped to this schema by the owning domain team. New fields could be added but existing fields couldn't be removed or renamed without a deprecation cycle.
-- Example: Canonical customer identity fields
customer_id VARCHAR -- CDP-generated UUID
ssn_hash VARCHAR -- SHA-256 of SSN (never plain SSN)
first_name VARCHAR
last_name VARCHAR
email VARCHAR -- Normalized
phone_primary VARCHAR -- E.164 format
address_line1 VARCHAR
city VARCHAR
state CHAR(2)
zip CHAR(5)
created_at TIMESTAMP
updated_at TIMESTAMP -- CDP-level, not source
source_system VARCHAR -- Origin system ID
source_system_id VARCHAR -- ID in origin system
Schema Governance
All schemas were versioned in Confluent Schema Registry with backward compatibility enforced. Producers couldn't publish breaking schema changes — they had to be approved through a schema review process. This saved us from countless silent data breakages.
Decision 4: Real-Time vs. Batch Serving
Not all use cases have the same freshness requirements:
| Use Case | Latency Requirement | Serving Layer |
|---|---|---|
| Fraud detection | <100ms | Kafka consumer + Redis cache |
| Customer service lookup | <500ms | REST API over Elasticsearch index |
| Marketing segmentation | 1–4 hours | Snowflake + scheduled refresh |
| Compliance reporting | Daily | Snowflake overnight batch |
| ML feature engineering | Hourly | Databricks + Feast feature store |
The key insight: don't build one serving layer to rule them all. Build the right serving layer for each latency tier and populate them all from the same unified Gold layer.
Operational Lessons
1. Schema Evolution Is Your Biggest Risk
Source systems change schemas without warning. We caught 40+ schema-breaking changes in the first year because Confluent Schema Registry rejected them at the connector level before they could corrupt downstream data. Invest heavily in schema governance early.
2. Monitor Consumer Lag, Not Just Throughput
Throughput metrics looked healthy during several incidents because messages were still flowing. Consumer lag was the real signal — it told us when a processing bottleneck was forming 20 minutes before it became a customer-facing problem.
3. Data Quality Gates Are Non-Negotiable
We added quality validation at the Silver layer: null checks, referential integrity, value range validation. Records failing validation went to a dead letter topic rather than silently corrupting the Gold layer. This increased trust across all downstream teams dramatically.
4. Backfill Strategy Matters
When you launch a CDP, you need to populate it with historical data. We underestimated this. Our backfill took 3 weeks because we tried to run it through the streaming pipeline at full volume. In retrospect: build a dedicated batch backfill path from day one, separate from the streaming path.
5. Ownership Requires SLA Commitment
We assigned each domain team a data quality SLA: 99.9% of events processed within 5 minutes. This accountability created the right incentives for teams to own their connectors, schemas, and Silver transformations properly.
Results After 4 Years
- 200+ tables unified from 10+ source systems
- 99.9% SLA on event processing timeliness
- 75% faster customer data lookups (Elasticsearch index vs. previous multi-system joins)
- $2M+ infrastructure savings by decommissioning redundant reporting ETL pipelines
- Zero data quality incidents reaching production in the last 18 months
- 10M+ daily users served across fraud, customer service, marketing, and compliance
Building a CDP is a multi-year investment. The first year is hard — schema fights, entity resolution edge cases, backfill headaches. But the compounding value as more use cases are unlocked from a single trusted data layer is transformative. Start with CDC, invest in schema governance, and resist the temptation to build one serving layer for all use cases.