Notes from Kafka Meetup

Last night, I attended (and presented at) Apache Kafka NYC Meetup. The meetup was organized by Joe Stein who did a good job quickly creating a level playing field by making sure everyone in the audience has basic understanding of Kafka.

Jay Kreps presented the new features coming up in the next Kafka releases. There is so much development going on that even though I’m deeply involved, some of the features he mentioned were new to me. I hope he’ll share the slides online soon.

Some points worth mentioning are:

  • We are planning on removing the direct dependency between Kafka’s consumers and Zookeeper.
    Zookeeper doesn’t scale very well to the number of consumers that we want to be able to support with Kafka and frankly having the clients interact with Zookeeper directly is pretty ugly. Jay explained how at LinkedIn they relieved some of the performance issues by moving Zookeeper to SSDs, but there’s a clear need for a long term plan. The new plan includes using the Kafka brokers to store the offsets the consumers are reading, rather than ZooKeeper – and I learned last night that the implementation will use the same mechanisms used for log compaction.
  • Compaction – this is actually an existing feature that not many people are familiar with. Normally, Kafka stores all data for a set amount of time, and deletes all older data. When compaction is enabled, Kafka will make sure it keeps the last value written for each key (in addition for all the values generated recently). This is very useful for event processing, database change data capture and streaming applications.
  • New producer (in 0.8.2) and the new consumer (0.9):
    • Written in Java (not Scala like the rest of Kafka). The reason is that Scala is not binary compatible between versions so having one Scala project (for example Spark) depend on artifacts from another is fairly challenging to manage.
    • The new producer  will remove the distinction between the current sync and async producers. It never blocks, instead, sending a message will return a “future” object, which you can do a “get” or use it for a callback.
  • Securing Kafka – a frequently requested improvement. The community came up with a design and we posted it in the wiki. We’d love to get feedback on it.
  • More cool features coming up in 0.9: 
    • Automated partition balancing
    • Throttling of partition movement
    • Scaling the number of partitions (currently limited to around 10,000)
    • Ability to apply quotas/throttling to topics and clients.
  • Longer term features we are interested in:
    • Adding support for exactly once semantics from the producer – currently if you get a network error after sending a message, you don’t know if the broker got the message or not. Resending the message can result in duplicates, while not sending the message can lead to lost events. It will be nice if the producer can “tag” the messages so that resending the same message will not result in duplicates. This can help support exactly once semantics.
    • Transactions – support atomic writes across topics or partitions.

I presented after Jay, on the topic of integrating Kafka and Hadoop. You can view my slides here:

The last presenter was Eric Sammer who gave a very engaging presentation on how he used Kafka for ScalingData’s event processing systems. He highlighted architecture decisions that are rarely considered in advance such as – when to separate messages between topics (and when you shouldn’t), which file formats to use, whether to process data in streams or micro-batches and much more.

The chats with attendees after the presentations were equally engaging – I learned about new use cases, got to discuss some outstanding development issues face to face and shared good food and drinks. Not a bad way to spend an evening in NYC.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Notes from Kafka Meetup' have 2 comments

  1. October 17, 2014 @ 6:28 pm Buntu

    Thanks for the slides, is there a video stream available as well from the meetup?
    We are interested in the Kafka->HDFS pipeline and would like to use Parquet format. This seems to be possible using Spark saveAsParquet() but want to hear your recommendations on schema evolution if we are not using Avro.

    Reply

    • October 20, 2014 @ 4:18 am Gwen Shapira

      As far as I know, Parquet doesn’t support schema evolution yet (although its on the road map).

      I’d also check if Flume’s Kite Sink supports Parquet (it supports Avro for sure). It may be easier to use Flume rather than SparkStreaming.
      One concern is that Parquet files work best when they are at least 1GB in size, and its unlikely that your RDDs will be large enough. You’ll need a second job to compact the files into a good size.

      Reply


Would you like reply to Buntu

Your email address will not be published.