Building a Kafka Data Pipeline: From 8 Hours to 5 Minutes

Context

At a fintech payment gateway platform, we operated two databases: Couchbase (NoSQL) as the primary operational store, and MySQL as a secondary relational store used for reporting and internal tooling.

Periodically, a full data synchronization was required — copying a large hierarchical data set from Couchbase into MySQL. This process included providers, their service offerings, and deeply nested subscription metadata.

Problem

The existing process was entirely manual: a DBA would run a series of scripts that read from Couchbase and wrote to MySQL. It:

Took ~8 hours to complete
Offered no parallelism — each entity was processed sequentially
Had no retry logic — a failure mid-run required starting over
Produced no observability — it was impossible to tell how far along it was
Blocked any dependent reporting processes during the run

The team needed a repeatable, reliable, and fast alternative.

Constraints

The Couchbase data was hierarchical: provider services → providers → bundles → subscription fields
The fan-out at each level was large (thousands of providers, many bundles each)
MySQL writes had to be atomic per-entity to avoid partial records
The pipeline needed to handle partial failures gracefully without losing data
We already had a Kafka consumer infrastructure in place — building a separate system would take longer

Design

Architecture Note

The pipeline uses Kafka topics as a coordination layer between processing stages. Each stage reads from one topic and produces to the next, enabling natural fan-out. A final fan-in step activates the new data only after all previous stages complete.

The architecture uses the existing Kafka consumer as the execution engine, structured as a series of dependent stages:

Stage 1: Root trigger

Read all provider services from Couchbase
Store each in MySQL as the master entity
Emit one Kafka message per provider service (fan-out)

Stage 2: Provider processing (per service)

For each provider service message, fetch its providers from Couchbase
Batch providers into groups of N
Store provider records in MySQL
Emit one Kafka message per batch (fan-out continues)

Stage 3: Bundle + subscription processing (per batch)

For each batch of providers, fetch their bundles and subscription fields from Couchbase
Store all in MySQL under the correct provider relationships

Stage 4: Activation (fan-in)

After all batches are confirmed processed, mark old data as inactive and newly inserted data as active in a single transaction

Tradeoffs

Option	Pros	Cons

Implementation Details

Used Kafka consumer groups to allow multiple workers to process batches in parallel
Each message contained enough context to be independently retried (idempotent design)
Failed messages were logged with structured metadata and queued for manual reprocessing
The activation step used a MySQL transaction to atomically swap old → new data
Structured logging at each stage produced a complete audit trail

Key design principle

Each Kafka message was designed to be self-contained and idempotent. Processing the same message twice produces the same MySQL state — critical for safe retries.

Result

~8 hrs→~5 min

Full data synchronization time

Processing time dropped from ~8 hours to ~5 minutes — a 99% reduction
The pipeline ran reliably with automatic retry on transient failures
DBA involvement reduced to zero for routine runs
Full observability via structured logs and Kafka consumer lag monitoring

Lessons Learned

Kafka is not just a message bus — it's a surprisingly capable coordination layer for multi-stage data pipelines when you already have consumer infrastructure.
Idempotent message design is non-negotiable for any pipeline with retry logic. Design for it from the start, not as an afterthought.
The fan-in step is always the hardest part. Track completion state explicitly — don't rely on timing or message ordering.
Structured logging at each stage is worth the extra effort; it made debugging partial failures straightforward.

What I Would Improve Next

Add a dead letter queue for messages that fail after N retries, with alerting
Instrument consumer lag on each stage topic to provide real-time ETA for the full run
Consider batching the activation step incrementally (activate per-service as it completes) to reduce the "all-or-nothing" risk of the fan-in