Kafka to Snowflake: Real-Time Ingestion Patterns at Scale
May 8, 2026
13 min read
The Integration Every Data Stack Eventually Needs
The combination of Apache Kafka and Snowflake is one of the most common pairings in modern enterprise data platforms. Kafka provides durable, ordered event streaming at high throughput; Snowflake provides scalable analytical compute with excellent SQL support and a cloud-native storage model. Getting data reliably from one to the other — with the right latency, cost profile, and schema strategy — is where teams regularly make expensive mistakes.
Having built and operated this integration at scale for a platform serving 10M+ daily users, I have worked through all three primary patterns in production. This article documents each one honestly: the latency it delivers, what it costs, how it handles schema evolution, and where it breaks under load.
Three Patterns, Three Latency Profiles
The Snowflake Kafka connector — maintained by Snowflake and available on Confluent Hub for both OSS Apache Kafka and Confluent Platform — supports two core data loading mechanisms: classic Snowpipe and Snowpipe Streaming. A third pattern uses Confluent Cloud's fully managed Snowflake sink connector. Each targets a different point on the latency-cost-simplicity spectrum.
| Pattern | Latency (end-to-end) | Ordering Guarantee | Schema Evolution | Operational Cost |
|---|---|---|---|---|
| Classic Snowpipe (v3 connector) | 1–5 minutes | Not guaranteed | Manual or schema detection | Low — self-managed connector |
| Snowpipe Streaming (v4 connector) | 10–30 seconds | Partition-ordered | Schema detection + evolution | Low — self-managed connector |
| Confluent Cloud managed sink | 30 seconds–3 minutes | At-least-once | Schema Registry integration | Higher — managed connector cost |
Pattern 1: Classic Snowpipe via the Kafka Connector
The classic pattern uses the Snowflake Kafka connector (v3, available for both OSS Kafka and Confluent) with Snowpipe as the loading mechanism. The connector buffers messages from Kafka topics, writes them to an internal Snowflake stage as temporary files when a threshold is reached (by time, memory, or message count), then triggers Snowpipe to ingest the staged files into the target table.
Every Snowflake table loaded by the connector gets two VARIANT columns: RECORD_CONTENT (the Kafka message payload as JSON or Avro) and RECORD_METADATA (topic, partition, offset, key, timestamp, schema ID). The data is stored in these semi-structured columns and can be queried with Snowflake's VARIANT syntax or flattened into typed columns via views or downstream transformations.
The latency of this pattern is bounded by Snowpipe's file ingestion pipeline: typically 1–5 minutes end-to-end from produce to queryable in Snowflake. This is acceptable for analytics and reporting workloads. It is not appropriate for any use case where sub-minute data freshness matters.
Important deprecation notice: Snowflake announced in 2026 that the classic Kafka connector (v3 and earlier) is planned for formal deprecation in mid-2026, with an 18-month migration window to end-of-life. For all new implementations, Snowflake recommends the v4 connector which uses Snowpipe Streaming by default.
Pattern 2: Snowpipe Streaming — The Low-Latency Path
Snowpipe Streaming is a fundamentally different loading mechanism introduced with the v4 Snowflake Kafka connector. Instead of buffering messages into files and triggering batch ingestion, Snowpipe Streaming uses a row-level ingestion SDK that writes data directly into Snowflake's columnar storage in a streaming fashion — no staging files, no batch pipeline.
The latency improvement is significant: end-to-end from Kafka produce to Snowflake queryable rows is typically 10–30 seconds in production configurations. For workloads that need near-real-time analytical freshness — fraud monitoring dashboards, operational reporting, near-real-time personalization — this is the right pattern.
Snowpipe Streaming with the v4 connector also enables schema detection and evolution. When the connector encounters new columns in incoming messages, it can automatically add columns to the target Snowflake table. This eliminates the brittle schema management that frequently breaks classic Snowpipe pipelines when upstream schemas change.
Configuration for a v4 connector with Snowpipe Streaming:
name=kafka-snowflake-sink
connector.class=com.snowflake.kafka.connector.SnowflakeSinkConnector
tasks.max=4
topics=payments.processed,orders.created
snowflake.url.name=ACCOUNT.snowflakecomputing.com:443
snowflake.user.name=KAFKA_SERVICE_USER
snowflake.private.key=<RSA_PRIVATE_KEY>
snowflake.database.name=ANALYTICS
snowflake.schema.name=RAW
snowflake.ingestion.method=SNOWPIPE_STREAMING
snowflake.enable.schematization=TRUE
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=com.snowflake.kafka.connector.records.SnowflakeJsonConverter
errors.tolerance=all
errors.deadletterqueue.topic.name=snowflake-dlq
Key configuration decisions:
snowflake.ingestion.method=SNOWPIPE_STREAMING— activates the row-level streaming path. Omit this and you get the classic Snowpipe batch path.snowflake.enable.schematization=TRUE— enables automatic column detection and schema evolution. The connector will ALTER TABLE to add new columns as they appear in messages.errors.deadletterqueue.topic.name— configures a Kafka topic to receive records that fail ingestion (malformed JSON, type mismatches). Snowpipe Streaming supports DLQ natively; the v3 connector does not.tasks.max— set to match the number of partitions on your Kafka topics for maximum parallelism. Each task handles one or more partitions.
Pattern 3: Confluent Cloud Managed Snowflake Sink
If you run Confluent Cloud, the fully managed Snowflake sink connector is available in the Confluent Cloud console with no Kafka Connect cluster to operate. You configure the connector through the UI or API; Confluent handles deployment, scaling, and monitoring.
This pattern is optimized for operational simplicity over minimum latency. The managed connector micro-batches data into Snowflake using Snowpipe, with typical end-to-end latency of 30 seconds to 3 minutes depending on connector configuration. It integrates natively with Confluent Schema Registry for Avro and Protobuf schemas.
The trade-off is cost: managed connectors on Confluent Cloud are billed per connector task per hour, adding $0.12–$0.40/hour per task depending on tier and connector type. At steady state with four tasks, that is $350–$1,200/month in connector cost alone. Whether this is acceptable depends on what you would spend operating a self-managed Connect cluster.
Schema Design and the VARIANT Columns
Understanding how the Kafka connector stores data in Snowflake is critical for building queryable tables. By default, every row has two VARIANT columns:
- RECORD_CONTENT — the Kafka message payload. For JSON messages, this contains the full JSON document. For Avro, the deserialized payload. You query nested fields with dot notation:
record_content:user_id::string. - RECORD_METADATA — connector metadata: topic, partition, offset, key, timestamp, schema_id, headers. Useful for deduplication and ordering logic downstream.
With schematization enabled (v4 connector + Snowpipe Streaming), the connector flattens the payload into typed columns automatically. Each top-level key in the JSON message becomes a column in the Snowflake table. Nested objects are stored as VARIANT sub-columns. This dramatically simplifies downstream SQL and eliminates the record_content:field::type casting pattern that clutters queries in the classic connector pattern.
For Avro schemas, use Confluent Schema Registry with the io.confluent.connect.avro.AvroConverter. The connector retrieves the schema by ID from the registry, deserializes the message, and maps fields to Snowflake columns. When the Avro schema evolves with a backward-compatible addition (new optional field), the connector automatically adds the new column to the Snowflake table without manual intervention.
Production Lessons at 10M+ User Scale
Running the Kafka-to-Snowflake pipeline for the Mr. Cooper Customer Data Platform — processing hundreds of thousands of events daily across mortgage servicing workflows — surfaced several operational realities that documentation does not cover.
Partition count determines your task parallelism ceiling. Each connector task can handle multiple partitions, but one partition can only be assigned to one task. If a topic has 12 partitions and you configure tasks.max=20, you still only get 12 tasks working in parallel. Design your topic partition counts with downstream connector parallelism in mind, not just Kafka producer throughput.
Snowpipe Streaming does not guarantee row insertion order across partitions. Within a single partition, rows are inserted in Kafka offset order. Across partitions, rows are not ordered. If your downstream Snowflake queries depend on a globally ordered event sequence, you need to handle ordering in SQL using event timestamps or Kafka offsets stored in RECORD_METADATA.
Dead letter queues are non-negotiable in production. Without a configured DLQ, a single malformed message will pause an entire connector task until the record is manually resolved. With a DLQ configured, bad records are routed to a dead-letter topic and the connector continues processing. Always configure a DLQ and build a monitoring alert on the DLQ topic consumer lag.
Snowpipe polling window limits ingestion status visibility. Snowflake polls the insertReport API for one hour. If a file's ingestion status cannot be confirmed within that window, the connector moves the file to a table stage. This is rare in normal operations but becomes a concern during Snowflake service disruptions. Implement connector-level monitoring on task status and Snowflake pipe status independently.
Iceberg table support. The v4 Snowflake Kafka connector supports writing to Iceberg tables, with the schema stored as structured types instead of VARIANT columns. This is the preferred path for new implementations where you want open table format compatibility and Snowflake's Iceberg support.
Frequently Asked Questions
What is the difference between Snowpipe and Snowpipe Streaming?
Classic Snowpipe ingests data from staged files (CSV, JSON, Parquet, Avro). The Kafka connector buffers messages, writes them to an internal stage, then triggers Snowpipe to process the files — resulting in 1–5 minutes of latency. Snowpipe Streaming uses a row-level SDK that writes directly to Snowflake's columnar storage without staging files, delivering 10–30 second latency. Snowpipe Streaming is the recommended path for all new Kafka-to-Snowflake integrations.
Does the Kafka connector guarantee exactly-once delivery to Snowflake?
The v3 connector with classic Snowpipe provides at-least-once delivery with deduplication logic in Snowpipe that eliminates most duplicates. The v4 connector with Snowpipe Streaming provides at-least-once delivery with connector-level offset tracking. Neither provides strict exactly-once by default. Implement idempotent MERGE operations in downstream transformations for use cases where duplicate rows cannot be tolerated.
Can I write to an existing Snowflake table or does the connector create new tables?
The connector can write to existing tables. If the specified table does not exist, the connector creates it. If writing to an existing table, the connector adds the RECORD_CONTENT and RECORD_METADATA columns and requires all other columns to allow NULL values. With schematization enabled, the connector automatically manages column additions for new fields.
What data formats does the Snowflake Kafka connector support?
The connector supports JSON (via SnowflakeJsonConverter or standard JsonConverter), Avro (via AvroConverter with Confluent Schema Registry), and Protobuf (via protobuf converter, added in connector version 1.5.0). Binary data should be passed through Avro or Protobuf for reliable schema management.
Should I use the v3 or v4 Snowflake Kafka connector for a new implementation?
Use the v4 connector for all new implementations. Snowflake announced in 2026 that the v3 connector is planned for deprecation with a formal announcement in mid-2026 and an 18-month migration window. The v4 connector supports Snowpipe Streaming, schema detection and evolution, Iceberg tables, and dead-letter queue handling — all unavailable in v3.
References
- Snowflake Kafka Connector Overview — docs.snowflake.com
- Snowpipe Streaming Documentation — docs.snowflake.com — Snowpipe Streaming
- Snowflake Kafka Connector v4 — docs.snowflake.com — v4 connector
- Snowflake Connector for Kafka on Confluent Hub — confluent.io/hub
- Migrate v3 to v4 — docs.snowflake.com — migration guide
- Confluent Cloud Snowflake Sink Connector — docs.confluent.io