Event-Driven Microservices with Kafka: 5 Patterns That Actually Scale

May 5, 2026

16 min read

Kafka
Microservices
Event-Driven Architecture
Distributed Systems
Platform Engineering
Event-Driven Microservices with Kafka: 5 Patterns That Actually Scale

Why Microservices Need Events

The standard explanation for event-driven architecture is that it decouples services. That is true but incomplete. The more precise benefit is that it decouples services in the right way — at the data boundary rather than at the network boundary. When services communicate by publishing events to Kafka topics instead of calling each other's APIs, you gain something more valuable than loose coupling: you gain independent deployability, replay capability, and the ability to add new consumers without touching the producer.

At Mr. Cooper, building a mortgage servicing platform for 10M+ customers, event-driven architecture was not an architectural preference — it was a practical necessity. With 10+ source systems producing data and 15+ downstream consumers needing to react to business events, point-to-point service calls would have created a dependency graph that was impossible to maintain. Kafka topics became the canonical event record for the entire platform.

But event-driven architecture is not a single pattern. It is a family of patterns that solve different problems. Using the wrong one for a given use case leads to systems that are harder to reason about, not easier. These are the five patterns I have found most valuable — and most commonly misapplied.

Event-driven microservices architecture with Kafka at the core

Pattern 1: Event Notification

Event Notification is the simplest pattern. A service publishes a minimal event to signal that something happened. The event contains just enough information for consumers to know they need to act — typically an event type and an entity ID. Consumers that care about the event fetch the current state from the producing service's API.

{
  "eventType": "customer.profile.updated",
  "customerId": "cust-98765",
  "occurredAt": "2026-05-05T10:30:00Z"
}

When to use it: When the event data is large or frequently changing and you do not want to replicate state in every consumer's topic backlog. When consumers need the current state at processing time rather than the state at event time. When the producing service owns the canonical record and consumers should always read from it.

When not to use it: When consumers cannot afford the latency of a synchronous API call after receiving the event. When the producing service may be unavailable at the time the consumer processes the event (temporal coupling). When you need to replay historical state — replaying events does not replay the state at the time each event occurred.

The hidden coupling problem: Event Notification is often mistaken for a fully decoupled pattern. It is not. Consumers are still coupled to the producer's API — they just move the coupling from compile time to runtime. If the producer API is unavailable when the consumer needs to fetch state, the consumer is blocked. For highly available consumers, prefer Event-Carried State Transfer.

Pattern 2: Event-Carried State Transfer

Event-Carried State Transfer (a term coined by Martin Fowler) embeds the relevant state directly in the event payload. Consumers do not need to call back to the producer to process the event — everything they need is in the message itself.

{
  "eventType": "payment.processed",
  "paymentId": "pay-44321",
  "customerId": "cust-98765",
  "amount": 1250.00,
  "currency": "USD",
  "status": "COMPLETED",
  "processedAt": "2026-05-05T10:30:00Z",
  "loanId": "loan-11234",
  "remainingBalance": 187500.00
}

When to use it: When consumers need to operate independently of the producer's availability. When event replay must reproduce the exact state at the time each event occurred (not current state). When multiple consumers need the same data and you want to avoid N×M API calls across services. This is the pattern most Kafka-native architectures converge on.

When not to use it: When the state in the event is a large document (hundreds of KB or more per message) — Kafka messages are not designed for large payloads. Use the Claim Check pattern: store the large payload in object storage (GCS, S3) and put the reference URL in the event. When the event payload exposes sensitive fields that should not be distributed to all consumers — events are replicated to every consumer group, so every subscriber sees every field.

Schema evolution is critical here. Because events are persisted in the Kafka log and replayed by consumers, the event schema is effectively a public API contract. Add fields with defaults. Never remove or rename required fields without a versioning strategy. Use Confluent Schema Registry with backward-compatible Avro or Protobuf schemas to enforce this contract at the producer.

Pattern 3: Transactional Outbox

The Transactional Outbox pattern solves the dual-write problem: the scenario where a service needs to update its database and publish a Kafka event as a single atomic operation. Without this pattern, you can write to the database but fail to publish the event (or vice versa), leaving your system in an inconsistent state.

The pattern works as follows:

  1. The service writes its business data update and an outbox event record in the same database transaction. The outbox table holds the pending event: topic, key, payload, and timestamp.
  2. A separate outbox relay process reads from the outbox table and publishes events to Kafka. This can be a polling relay (a scheduled job that reads uncommitted outbox rows) or a CDC relay (a Debezium connector tailing the database transaction log to capture outbox table changes).
  3. Once an event is confirmed published, the outbox record is marked as processed or deleted.
-- In the service's database transaction
BEGIN;
  INSERT INTO payments (id, amount, status) VALUES ('pay-44321', 1250.00, 'COMPLETED');
  INSERT INTO outbox_events (id, topic, key, payload, status, created_at)
    VALUES (gen_random_uuid(), 'payment.processed', 'cust-98765',
            '{"paymentId":"pay-44321",...}', 'PENDING', NOW());
COMMIT;

-- Debezium CDC connector tails the outbox table
-- and publishes to Kafka automatically

When to use it: Whenever a service must update its own database and notify downstream consumers reliably. This is the correct solution to the dual-write problem — the only alternative (distributed two-phase commit) introduces coordination overhead that destroys throughput. At Mr. Cooper, every payment event, loan status change, and customer data update that needed guaranteed downstream notification used the Transactional Outbox via Debezium.

The CDC relay advantage: Using Debezium to tail the transaction log (rather than a polling relay) delivers near-zero latency between the database write and the Kafka event. There is no polling interval. The relay detects the outbox insert within milliseconds of the database commit and publishes to Kafka. The polling relay is simpler to implement but adds latency equal to the poll interval (typically 1–15 seconds).

Engineering team designing distributed systems architecture with Kafka

Pattern 4: CQRS with Kafka as the Synchronization Bus

Command Query Responsibility Segregation (CQRS) separates the write model (commands that mutate state) from the read model (queries that serve data). Kafka serves as the synchronization bus that propagates write-side state changes to read-side materialized views, enabling each side to be independently scaled and optimized.

In practice for a mortgage platform: the loan servicing service (write side) owns the authoritative loan state and publishes loan.status.changed events to Kafka. A separate read-side service consumes these events and maintains a denormalized, query-optimized materialized view in a different data store (Elasticsearch for full-text search, Redis for low-latency key lookups, Snowflake for analytical queries). Each read store is optimized for its query pattern without affecting the write-side database.

When to use it: When read and write load profiles differ dramatically (heavy analytical reads, transactional writes). When different consumers need the same data in different shapes (one needs it sorted by customer, another by date, another in full-text searchable form). When you need to scale read capacity independently of write capacity. When a single source of truth (the Kafka topic) needs to power multiple specialized query interfaces.

The eventual consistency trade-off. CQRS with Kafka introduces eventual consistency between the write model and the read model. After a command is processed, the corresponding event is published to Kafka, consumed by the read service, and written to the read store. This pipeline adds latency — typically seconds in a well-tuned setup. If a user performs a write and immediately reads back the same data, they may see stale state. This is usually acceptable for analytical or reporting workloads; it requires explicit handling (optimistic UI updates, read-your-writes guarantees from the write-side API) for real-time user interfaces.

Rebuilding read models from the topic. One of Kafka's most valuable properties for CQRS is log retention. If a read-side data store needs to be rebuilt — due to a bug in the consumer, a schema migration, or deployment of a new query store type — you can replay the entire topic from offset 0 to reconstruct the materialized view without touching the write-side database or any other system. This makes the read model disposable and the write model the single source of truth.

Pattern 5: Dead Letter Stream

The Dead Letter Stream pattern handles consumer processing failures without data loss or pipeline stalls. When a consumer cannot process a message — due to a deserialization error, validation failure, downstream service unavailability, or business rule violation — the message is published to a dead-letter topic instead of being retried indefinitely or discarded.

A robust dead letter implementation has three components:

  1. Error capture. The consumer catches processing exceptions and publishes the failed message to a dead-letter topic with enriched metadata: original topic, original partition, original offset, error type, error message, stack trace, timestamp. This context is essential for diagnosing and reprocessing failures.
  2. Monitoring and alerting. Consumer lag on the dead-letter topic triggers alerts. Any growth in dead-letter lag indicates a systematic processing failure that needs attention. Alert on dead-letter topic lag above 0 in production.
  3. Reprocessing workflow. A separate service reads from the dead-letter topic and replays messages to the original topic (or a corrected processing path) after the underlying issue is resolved. This should be a manual trigger for non-systematic failures, and an automated workflow for transient failures (e.g., temporary downstream unavailability).
// Consumer error handler — publish to DLQ instead of stopping
@KafkaListener(topics = "payment.processed")
public void handlePayment(ConsumerRecord<String, String> record) {
  try {
    processPayment(record.value());
  } catch (ProcessingException e) {
    DeadLetterRecord dlr = DeadLetterRecord.builder()
        .originalTopic(record.topic())
        .originalPartition(record.partition())
        .originalOffset(record.offset())
        .payload(record.value())
        .errorType(e.getClass().getSimpleName())
        .errorMessage(e.getMessage())
        .failedAt(Instant.now())
        .build();
    deadLetterProducer.send("payment.processed.dlq", dlr);
  }
}

When to use it: Always. Every consumer in a production Kafka system should have a dead-letter strategy. The question is not whether to use it but what your reprocessing workflow looks like. Consumers without DLQ handling either stall (if the consumer is configured to stop on errors) or silently drop messages (if errors are swallowed). Both outcomes are worse than routing to a dead-letter topic.

Confluent Cloud DLQ integration. Confluent Cloud managed connectors and the Snowflake Kafka connector v4 with Snowpipe Streaming support dead-letter queues natively via the errors.deadletterqueue.topic.name connector configuration. For custom consumer code, implement DLQ handling explicitly in your error handling path.

How We Combined These Patterns at Mr. Cooper

On the Customer Data Platform serving 10M+ users, the five patterns worked together as a coherent system:

  • Transactional Outbox ensured every loan status change, payment, and customer update reliably published to Kafka from the source-of-record database without dual-write risk. Debezium tailed the outbox tables for each source system.
  • Event-Carried State Transfer carried full customer and loan state on each event, enabling downstream consumers to operate without calling back to source systems. This eliminated the upstream coupling that had previously caused cascade failures.
  • CQRS powered the unified customer view: the write-side remained distributed across source systems; a dedicated consumer service materialized a unified customer record into a query-optimized read store from which all downstream analytics and customer-facing features read.
  • Dead Letter Streams on every consumer prevented any single bad message from stalling a pipeline. We processed millions of events daily; schema surprises and validation edge cases were inevitable. The DLQ meant they were monitored exceptions, not incidents.
  • Event Notification was used selectively for high-frequency low-urgency signals (document upload confirmations, preference updates) where the consumer always needed current state rather than point-in-time state.
Building production microservices architecture at enterprise scale

Pitfalls That Cost Teams Months

Publishing events before the database commit completes. If a service publishes to Kafka before or outside of its database transaction, a transaction rollback will leave a published event with no corresponding database state. Downstream consumers will process an event for data that does not exist. The Transactional Outbox pattern is the correct fix — never publish directly to Kafka inside an application-level transaction without outbox semantics.

Using event time for ordering in multi-partition topics. Message timestamps in Kafka reflect when the message was produced, not when the underlying business event occurred. Network jitter, clock skew between producers, and out-of-order producer retries mean that event timestamps are not monotonically ordered across partitions. Use Kafka offset for ordering within a partition. For global ordering requirements, use a single partition (which limits throughput to one consumer) or implement a watermark-based ordering window in your stream processor.

Building consumers that cannot handle duplicate messages. Kafka at-least-once delivery means a consumer may receive the same message more than once — after a consumer restart, a rebalance, or a coordinator timeout. Every consumer should be idempotent: processing the same event twice should produce the same result as processing it once. The simplest pattern: include an event ID in every event, and use a Redis or database deduplication check in the consumer before processing.

Designing events as internal implementation details. Events published to Kafka are a public contract the moment the first consumer subscribes. Teams that design events for their own immediate needs — exposing internal field names, coupling event structure to database schema, adding undocumented implementation-specific fields — accumulate technical debt that becomes expensive to unwind as more consumers depend on the schema. Design events as public APIs from day one: clear field names, documented schema, explicit versioning strategy.

Frequently Asked Questions

What is the difference between Event Notification and Event-Carried State Transfer?

Event Notification is a minimal signal that something happened — the consumer must call back to the producer to fetch current state. Event-Carried State Transfer embeds the relevant state directly in the event so consumers can process it independently. Event-Carried State Transfer is more resilient (no upstream coupling at processing time) but requires careful schema management since the event is a contract for all consumers.

Is the Transactional Outbox pattern expensive to implement?

The core database change — adding an outbox table and writing to it in the same transaction — is straightforward in any relational database. The relay process adds complexity: polling relay is simple to build but adds latency; CDC relay (Debezium) delivers near-zero latency but requires database binlog access and connector infrastructure. For new implementations on Confluent, use Debezium as a Kafka Connect source connector. The setup cost is amortized quickly for any service where reliable event publishing matters.

Does CQRS require event sourcing?

No. CQRS and event sourcing are independent patterns that are often used together but do not require each other. CQRS separates read and write models. Event sourcing stores state as a sequence of events rather than the current state. You can implement CQRS with a traditional database on the write side and Kafka-driven materialized views on the read side without ever using event sourcing. Event sourcing adds significant complexity and is appropriate for specific audit-heavy domains, not general application architecture.

How do I choose between these five patterns for a new microservice?

Start with the questions: Does the consumer need state at event time (use Event-Carried State Transfer) or current state (use Event Notification)? Does the producer need atomic database write + event publish (use Transactional Outbox)? Does the consumer need to handle failures without data loss (always use Dead Letter Stream)? Does the same write-side data need to power multiple different query patterns (use CQRS). Most services in a mature platform use Transactional Outbox + Event-Carried State Transfer + Dead Letter Stream as the baseline, with CQRS added when read/write scaling diverges.

How large should a Kafka event payload be?

The default Kafka maximum message size is 1MB (message.max.bytes). In practice, keep event payloads well under 100KB for healthy throughput and reasonable memory overhead in consumers. For payloads larger than 100KB — document content, image metadata, large JSON blobs — use the Claim Check pattern: store the payload in object storage (GCS, S3) and include only the storage reference in the Kafka event.

References