From Relational into Kafka

Years back I had a very popular tutorial on how to migrate a Data Warehouse to Hadoop. You can see the outline on Percona’s conference site. I basically spent the last 4 years of my life migrating data from relational databases to Hadoop.

Now that I’m working on Kafka, its time to migrate your Data Warehouse to Kafka.

Just kidding. At least sort of kidding.

I maintain my previous position that Hadoop is a great Data Warehouse at a very attractive price. The same is not true for Kafka – Kafka is not a Data Warehouse. Not even if we let Hive query Kafka directly. The data access patterns are just too different.

Kafka is, however, a fantastic pipeline and a message bus.

Why do we need to send data from relational databases to Kafka? Because there are tons of applications that can enjoy access to data in your relational databases (especially OLTP, but not only) and you definitely don’t want every app in your organization accessing your OLTP database directly. Remember, that is the database that causes you to lose money when it crashes.

The solution, get the data out of your database, into Kafka, where everyone can access it without causing any lack of sleep to the DBAs. More importantly, without having to talk to the DBAs at all.

How do we do it? First, if you are a DBA and don’t know what’s the Kafka thing I’m talking about – go look at my slides on Kafka for DBAs.

Back? Lets get some data! Here are all the ways I’m familiar with of getting data from relational DBs to Kafka:

  1. Sqoop2 can get data from any JDBC-compatible source into Kafka. The catch? It accesses the DB directly, so you still need to talk to your DBA. You can tell him that Sqoop2 throttles database access so even though Kafka and Sqoop are awesomely parallelizable – we won’t crash his DB.
  2. The awesome James Cheng maintains a wiki with all projects doing MySQL to Kafka CDC.  CDC-based ingest is very safe because the application gets data straight from the database transaction log (binlog in case of MySQL) – not requiring direct access to production tables, not taking locks, etc.
  3. On similar note, the awesome Martin Kleppmann wrote a project for getting data from Postgres to Kafka by way of Postgres transaction log. You can read all about it on the Confluent blog. This project gets bonus points for getting the data in Avro format, which means you don’t lose your schema on the way to Kafka. All other projects use some kind of text-separated format, which is not as awesome. I’m trying to find time to fix this in Sqoop2.
  4. Want data from Oracle’s Redo log (or archive logs)? Get ready to pay. GoldenGate has a Flume adapter that can land data in Kafka by way of Flafka, or they provide a nice API that you can use to write directly to Kafka. GoldenGate is not cheap, so you can talk to the nice folks at DBVisit - rumors say that they are working on their own CDC solution from Oracle to Kafka.
    BTW, if you can’t figure out where to download GoldenGate Big Data adapters, they are hiding in  under the super obvious category of Fusion Middleware. Don’t ask me.

What do we do with the data after its in Kafka? Here are some popular options:

  1. Use with text-indexing solution such as Cloudera Search, Elastisearch or similar
  2. Process the data a bit using stream processing solution and load into a NoSQL database where it can act as a fast “materialized view”.
  3. Move to an auditing system
  4. Move to a monitoring system
  5. Display in different dashboards
  6. Warm up application caches

Know other ways of getting data from databases to Kafka? What do you do with the data once its in Kafka? Share in the comments.


Tweet about this on TwitterShare on FacebookShare on LinkedIn

'From Relational into Kafka' have 1 comment

  1. April 27, 2015 @ 11:17 am Patrick Jaromin

    I see a lot of potential uses for trans logs in Kafka. At Conversant we’re in the process of adapting a proprietary data abstraction/access layer into Kafka to make the changes available for both Storm and MapReduce processing in a Labmda-ish architecture. The next steps are to eliminate the adapter and get our changesets directly from the RDBMS, perhaps via transaction logs, which would make any efforts in this space very interesting to me.


Would you like reply to Patrick Jaromin

Your email address will not be published.