What was Hot at Strata 2015?

Here’s a pro-tip (although not necessarily an ingest tip): You go to conferences to get industry gossip, and you get the best gossip at the exhibit hall. Best time is Thursday evening when beer and cocktails are served. People relax, have fun and even competitors seem to spill the beans on not-yet-announced products and features. But for me, the really interesting information comes from our customers – what are they interested in? where are the current pain points? Which product is the current trendy “cure for cancer”?

I learn all those mostly from the questions people ask me. Its very much a win-win situation – people ask me questions and I do my best to help. They get hopefully informed answers, and I learn what they are interested in.

So, what is everyone interested in?

  1. Metadata - For me, this was definitely the biggest surprise of the conference. It also shows how Strata became more “corporate” over time.  People asked me how to manage data sets in ways that allows users to discover relevant data. They want to know to track who created the data set, when and how. I wish I had an open source answer, but the only solution I’m familiar with is Cloudera Navigator.
  2. Kafka Security – I lost track of the number of times people asked me about it. Its on the way. Kafka itself is still trendy, I think this goes without saying. Jay Kreps couldn’t walk 2 feet without getting mobbed. Also, lots of discussion on integration with Kafka. Integration is clearly trivial if you write Java code, clearly painful if you use PHP and if you don’t program at all – Flafka is still the best solution out there.
  3. SparkStreaming vs. Storm – I think a lot of people realized that neither of these cures cancer. Each solution has its own trade-offs and pain points. Very few people wanted to try out Samza though. I think there’s a specific set of features that people want, and its mostly a matter of who gets there first:
    1. Kafka integration is pretty much a must have.
    2. Never lose data and keep duplicates to a very low minimum
    3. Nice GUI for seeing current lags, throughputs, flow, skew, etc
    4. Both Micro-batches and low-latency-single-event-processing is a must-have now
    5. Ability to dump data to HDFS, HBase and Kafka (at very least).
    6. Local-state may be a good idea. I think the jury is still out on that
    7. Everyone wants better APIs. Know one is sure what they look like yet.
    8. YARN may or may not be a good idea.
  4. No one was horrified when I said that Schema is mandatory and Schema on Write is actually a very good idea. Not everyone loves Avro though.
  5. YARN no longer cures cancer. At least, this year no one asked when we’ll release Flume on YARN.
  6. Spark still cures cancer. Actually, we hear more and more Spark success stories. Maybe its no longer a fad but an actual useful solution. Some of the Spark sessions were very practical. We are getting tons of questions on SparkSQL too. Can’t wait for Hive-on-Spark to show up.
  7. From the smaller vendors – Trifacta are still one of the hottest startups around. Everyone loves clean data. Snowflake won some kind of startup competition and I heard lots of people discuss the idea of data warehouses in the cloud.
  8. CDC - Tons of discussion about this very complex topic. There’s the component of how to get changes out of databases (GoldenGate, Attunity or Shareplex seem to do the job), where to store the data (Everyone supports HDFS, GoldenGate supports HBase and Kafka) and how to merge the data in Hadoop (Alan Gates recommends Hive’s new batch update feature).

If you attended my presentations and want the slides, you can find them here:

Hope you got to attend Strata and learned something new. Feel free to share your most exciting finds from Strata in the comments below.

 

Tweet about this on TwitterShare on FacebookShare on LinkedIn


'What was Hot at Strata 2015?' have 5 comments

  1. February 23, 2015 @ 4:28 am Robert Berger

    In Data Architectures for Robust Decision Making, on schemas you mention that Avro is great, except if you are using Go. Any suggestions on how Go can be part of the action? We’re using Go and Clojure

    Reply

  2. February 23, 2015 @ 7:54 pm Dan Osipov

    Re: “YARN may or may not be a good idea.”
    I would love to know some pros and cons?
    Is it compared to using Mesos for cluster management?

    Reply

    • February 23, 2015 @ 9:27 pm Gwen Shapira

      @Dan
      I mostly reflected the debate I was hearing at Strata. SparkStreaming, Storm and Samza all work with YARN, but I’ve heard a lot of users questioning the benefits of this, and also whether more efficiencies (such as automated scaling or data locality) can be achieved with YARN.
      Unfortunately, I haven’t been very involved with Mesos, so I do not know the pros/cons of Mesos vs. Yarn in detail.

      Reply

      • February 23, 2015 @ 10:33 pm Dan Osipov

        Thanks for the clarification! Since Spark Streaming, Storm, Samza, Tez, etc need a cluster manager to run, they either need to invent their own, or use a “standard” one like YARN or Mesos. From my experience, the benefits of using YARN are not always clear, but I would love to be able to deploy a cluster with some resource manager and have whatever application make use of it.

        Reply


Would you like reply to Gwen Shapira

Your email address will not be published.