Distributed SystemsKafkaData PipelinePythonBackend

Building a Kafka Data Pipeline: From 8 Hours to 5 Minutes

How I designed a fan-out / fan-in Kafka pipeline to replace a brittle 8-hour DBA process for migrating hierarchical data from Couchbase to MySQL.

Impact: Reduced processing time from ~8 hours to ~5 minutes

2026-04-15

Context

At a fintech payment gateway platform, we operated two databases: Couchbase (NoSQL) as the primary operational store, and MySQL as a secondary relational store used for reporting and internal tooling.

Periodically, a full data synchronization was required — copying a large hierarchical data set from Couchbase into MySQL. This process included providers, their service offerings, and deeply nested subscription metadata.

Problem

The existing process was entirely manual: a DBA would run a series of scripts that read from Couchbase and wrote to MySQL. It:

  • Took ~8 hours to complete
  • Offered no parallelism — each entity was processed sequentially
  • Had no retry logic — a failure mid-run required starting over
  • Produced no observability — it was impossible to tell how far along it was
  • Blocked any dependent reporting processes during the run

The team needed a repeatable, reliable, and fast alternative.

Constraints

  • The Couchbase data was hierarchical: provider services → providers → bundles → subscription fields
  • The fan-out at each level was large (thousands of providers, many bundles each)
  • MySQL writes had to be atomic per-entity to avoid partial records
  • The pipeline needed to handle partial failures gracefully without losing data
  • We already had a Kafka consumer infrastructure in place — building a separate system would take longer

Design

Architecture Note

The pipeline uses Kafka topics as a coordination layer between processing stages. Each stage reads from one topic and produces to the next, enabling natural fan-out. A final fan-in step activates the new data only after all previous stages complete.

The architecture uses the existing Kafka consumer as the execution engine, structured as a series of dependent stages:

Stage 1: Root trigger

  • Read all provider services from Couchbase
  • Store each in MySQL as the master entity
  • Emit one Kafka message per provider service (fan-out)

Stage 2: Provider processing (per service)

  • For each provider service message, fetch its providers from Couchbase
  • Batch providers into groups of N
  • Store provider records in MySQL
  • Emit one Kafka message per batch (fan-out continues)

Stage 3: Bundle + subscription processing (per batch)

  • For each batch of providers, fetch their bundles and subscription fields from Couchbase
  • Store all in MySQL under the correct provider relationships

Stage 4: Activation (fan-in)

  • After all batches are confirmed processed, mark old data as inactive and newly inserted data as active in a single transaction

Tradeoffs

OptionProsCons

Implementation Details

  • Used Kafka consumer groups to allow multiple workers to process batches in parallel
  • Each message contained enough context to be independently retried (idempotent design)
  • Failed messages were logged with structured metadata and queued for manual reprocessing
  • The activation step used a MySQL transaction to atomically swap old → new data
  • Structured logging at each stage produced a complete audit trail

Key design principle

Each Kafka message was designed to be self-contained and idempotent. Processing the same message twice produces the same MySQL state — critical for safe retries.

Result

~8 hrs~5 min

Full data synchronization time

  • Processing time dropped from ~8 hours to ~5 minutes — a 99% reduction
  • The pipeline ran reliably with automatic retry on transient failures
  • DBA involvement reduced to zero for routine runs
  • Full observability via structured logs and Kafka consumer lag monitoring

Lessons Learned

  • Kafka is not just a message bus — it's a surprisingly capable coordination layer for multi-stage data pipelines when you already have consumer infrastructure.
  • Idempotent message design is non-negotiable for any pipeline with retry logic. Design for it from the start, not as an afterthought.
  • The fan-in step is always the hardest part. Track completion state explicitly — don't rely on timing or message ordering.
  • Structured logging at each stage is worth the extra effort; it made debugging partial failures straightforward.

What I Would Improve Next

  • Add a dead letter queue for messages that fail after N retries, with alerting
  • Instrument consumer lag on each stage topic to provide real-time ETA for the full run
  • Consider batching the activation step incrementally (activate per-service as it completes) to reduce the "all-or-nothing" risk of the fan-in