Here’s a pro-tip (although not necessarily an ingest tip): You go to conferences to get industry gossip, and you get the best gossip at the exhibit hall. Best time is Thursday evening when beer and cocktails are served. People relax, have fun and even competitors seem to spill the beans on not-yet-announced products and features. But for me, the really interesting information comes from our customers – what are they interested in? where are the current pain points? Which product is the current trendy “cure for cancer”?
I learn all those mostly from the questions people ask me. Its very much a win-win situation – people ask me questions and I do my best to help. They get hopefully informed answers, and I learn what they are interested in.
So, what is everyone interested in?
- Metadata - For me, this was definitely the biggest surprise of the conference. It also shows how Strata became more “corporate” over time. People asked me how to manage data sets in ways that allows users to discover relevant data. They want to know to track who created the data set, when and how. I wish I had an open source answer, but the only solution I’m familiar with is Cloudera Navigator.
- Kafka Security – I lost track of the number of times people asked me about it. Its on the way. Kafka itself is still trendy, I think this goes without saying. Jay Kreps couldn’t walk 2 feet without getting mobbed. Also, lots of discussion on integration with Kafka. Integration is clearly trivial if you write Java code, clearly painful if you use PHP and if you don’t program at all – Flafka is still the best solution out there.
- SparkStreaming vs. Storm – I think a lot of people realized that neither of these cures cancer. Each solution has its own trade-offs and pain points. Very few people wanted to try out Samza though. I think there’s a specific set of features that people want, and its mostly a matter of who gets there first:
- Kafka integration is pretty much a must have.
- Never lose data and keep duplicates to a very low minimum
- Nice GUI for seeing current lags, throughputs, flow, skew, etc
- Both Micro-batches and low-latency-single-event-processing is a must-have now
- Ability to dump data to HDFS, HBase and Kafka (at very least).
- Local-state may be a good idea. I think the jury is still out on that
- Everyone wants better APIs. Know one is sure what they look like yet.
- YARN may or may not be a good idea.
- No one was horrified when I said that Schema is mandatory and Schema on Write is actually a very good idea. Not everyone loves Avro though.
- YARN no longer cures cancer. At least, this year no one asked when we’ll release Flume on YARN.
- Spark still cures cancer. Actually, we hear more and more Spark success stories. Maybe its no longer a fad but an actual useful solution. Some of the Spark sessions were very practical. We are getting tons of questions on SparkSQL too. Can’t wait for Hive-on-Spark to show up.
- From the smaller vendors – Trifacta are still one of the hottest startups around. Everyone loves clean data. Snowflake won some kind of startup competition and I heard lots of people discuss the idea of data warehouses in the cloud.
- CDC - Tons of discussion about this very complex topic. There’s the component of how to get changes out of databases (GoldenGate, Attunity or Shareplex seem to do the job), where to store the data (Everyone supports HDFS, GoldenGate supports HBase and Kafka) and how to merge the data in Hadoop (Alan Gates recommends Hive’s new batch update feature).
If you attended my presentations and want the slides, you can find them here:
- Data Architectures for Robust Decision Making (Should have been called Agile Data Pipelines): http://www.slideshare.net/gwenshap/data-architectures-for-robust-decision-making
- Hadoop Application Architectures – Clickstream Use-Case Tutorial: http://www.slideshare.net/hadooparchbook/architectural-considerations-for-hadoop-applications
Hope you got to attend Strata and learned something new. Feel free to share your most exciting finds from Strata in the comments below.