Building a Kafka Data Pipeline: From 8 Hours to 5 Minutes
How I designed a fan-out / fan-in Kafka pipeline to replace a brittle 8-hour DBA process for migrating hierarchical data from Couchbase to MySQL.
Impact: Reduced processing time from ~8 hours to ~5 minutes
Context
At a fintech payment gateway platform, we operated two databases: Couchbase (NoSQL) as the primary operational store, and MySQL as a secondary relational store used for reporting and internal tooling.
Periodically, a full data synchronization was required — copying a large hierarchical data set from Couchbase into MySQL. This process included providers, their service offerings, and deeply nested subscription metadata.
Problem
The existing process was entirely manual: a DBA would run a series of scripts that read from Couchbase and wrote to MySQL. It:
- Took ~8 hours to complete
- Offered no parallelism — each entity was processed sequentially
- Had no retry logic — a failure mid-run required starting over
- Produced no observability — it was impossible to tell how far along it was
- Blocked any dependent reporting processes during the run
The team needed a repeatable, reliable, and fast alternative.
Constraints
- The Couchbase data was hierarchical: provider services → providers → bundles → subscription fields
- The fan-out at each level was large (thousands of providers, many bundles each)
- MySQL writes had to be atomic per-entity to avoid partial records
- The pipeline needed to handle partial failures gracefully without losing data
- We already had a Kafka consumer infrastructure in place — building a separate system would take longer
Design
Architecture Note
The pipeline uses Kafka topics as a coordination layer between processing stages. Each stage reads from one topic and produces to the next, enabling natural fan-out. A final fan-in step activates the new data only after all previous stages complete.
The architecture uses the existing Kafka consumer as the execution engine, structured as a series of dependent stages:
Stage 1: Root trigger
- Read all provider services from Couchbase
- Store each in MySQL as the master entity
- Emit one Kafka message per provider service (fan-out)
Stage 2: Provider processing (per service)
- For each provider service message, fetch its providers from Couchbase
- Batch providers into groups of N
- Store provider records in MySQL
- Emit one Kafka message per batch (fan-out continues)
Stage 3: Bundle + subscription processing (per batch)
- For each batch of providers, fetch their bundles and subscription fields from Couchbase
- Store all in MySQL under the correct provider relationships
Stage 4: Activation (fan-in)
- After all batches are confirmed processed, mark old data as inactive and newly inserted data as active in a single transaction
Tradeoffs
| Option | Pros | Cons |
|---|
Implementation Details
- Used Kafka consumer groups to allow multiple workers to process batches in parallel
- Each message contained enough context to be independently retried (idempotent design)
- Failed messages were logged with structured metadata and queued for manual reprocessing
- The activation step used a MySQL transaction to atomically swap old → new data
- Structured logging at each stage produced a complete audit trail
Key design principle
Each Kafka message was designed to be self-contained and idempotent. Processing the same message twice produces the same MySQL state — critical for safe retries.
Result
Full data synchronization time
- Processing time dropped from ~8 hours to ~5 minutes — a 99% reduction
- The pipeline ran reliably with automatic retry on transient failures
- DBA involvement reduced to zero for routine runs
- Full observability via structured logs and Kafka consumer lag monitoring
Lessons Learned
- Kafka is not just a message bus — it's a surprisingly capable coordination layer for multi-stage data pipelines when you already have consumer infrastructure.
- Idempotent message design is non-negotiable for any pipeline with retry logic. Design for it from the start, not as an afterthought.
- The fan-in step is always the hardest part. Track completion state explicitly — don't rely on timing or message ordering.
- Structured logging at each stage is worth the extra effort; it made debugging partial failures straightforward.
What I Would Improve Next
- Add a dead letter queue for messages that fail after N retries, with alerting
- Instrument consumer lag on each stage topic to provide real-time ETA for the full run
- Consider batching the activation step incrementally (activate per-service as it completes) to reduce the "all-or-nothing" risk of the fan-in