Accomplishments in Lake Tahoe 2015

Part of the ingest.tips team took a trip to Truckee in North Lake Tahoe. We rented a house and relaxed/coded the entire week. There were a lot of fun activities there including: hiking, biking, motorcycle riding (I ended up bringing my CB300F), relaxing in a hot tub, etc. It was great experience… check out the photos from the engineering in the wild blog post. I thought it would be interesting to iterate over what we achieved over this brief work-ation.

1. Kite wrapper for existing data

Kite is an API for working with data in Hadoop at a high level, like a Hive table instead of a collection of files. The extension we worked on was to make it easy to wrap existing folders of data and add configuration to them so they work as Kite datasets. For example, if you have a directory of Parquet files and want to make it available to Hive, you can now run create and point Kite at that existing directory. Kite will figure out the schema, partitioning, and that you’re using the Parquet format and create the table for you. This will be available in the upcoming 1.1.0 Kite release.

2. CSV and JSON entity parsers for the Flume Dataset Sink

The DatasetSink is a Kite-based sink that works like the Flume HDFS sink, only it handles partitioning automatically according to the Kite dataset’s configuration and also supports multiple schema versions and rolling restarts. Before, the sink only accepted Avro-encoded records, but in Tahoe we added support to parse Flume events as JSON or CSV just like the Kite command-line does.

3. Sqoop2 HBase connector

The HBase connector patch has been under construction for a little while now. The biggest blockers being the lack of testing and an unknown configuration story. Over the course of the trip, I threw together integration tests for the HBase connector in SQOOP-1883. Hopefully this will make it into the Sqoop 1.99.7 release.

4. Sqoop Kite/Hive integration

In SQOOP-1529, Sqoop is receiving a fully functional kite connector. It allows Sqoop to use the rich feature set that Kite provides such as: HDFS, Hive, and HBase integration; Avro, CSV, and Parquet support; and various compression scheme support. SQOOP-1998 was recently committed which adds support for Hive in the kite connector. This should make it in for the Sqoop 1.99.6 release.

5. Incremental data transfer

Sqoop community is aggressively pushing  to finish incremental import which is one of the major features that is available in Sqoop 1. During the hack week we’ve added support of incremental import to HDFS Connector via SQOOP-1949. We’ve also made significant progress on storing the state in Sqoop 2 repository in SQOOP-1803 that will finish the feature end to end.  Sadly we’ve uncover very old and unnoticed bug in Hadoop where Hadoop is not returning back correct configuration object. This was fixed on Hadoop side via MAPREDUCE-5875 and later reverted in MAPREDUCE-6288 as it caused additional issues.

What’s next?

You tell us!

 

E: team@ingest.tips

 

Tweet about this on TwitterShare on FacebookShare on LinkedIn


'Accomplishments in Lake Tahoe 2015' have 3 comments

  1. April 17, 2015 @ 5:08 pm Buntu

    These are all awesome features.. thanks!

    Are there any examples for #1 & #2?

    Reply

    • July 30, 2015 @ 7:49 pm Dana Forsberg

      We would find “1. Kite wrapper for existing data” very useful — particularly with parquet data (making a hive table definition from it automatically).
      How exactly is the “extension” for doing this used? Via CLI? Via a java program (what method call?)?
      Any pointer would be helpful — but an example with a “how to” explanation would be ideal.
      Thanks for any help. I see the previous blog poster has the same sort of question for #1.

      Reply

      • July 31, 2015 @ 5:31 pm Ryan Blue

        The feature to create a dataset around existing data was rolled into the “create” CLI command: http://kitesdk.org/docs/1.1.0/cli-reference.html#create

        You can now add a –location argument to point the tool to an existing directory of data. Kite will inspect the data to create an Avro schema and use the directory structure to create a partition strategy. Then the dataset is created with those or they are used to validate the ones you provide. There are examples in the doc linked to above, but feel free to ask more questions on the mailing list: cdk-dev@cloudera.org

        Reply


Would you like to share your thoughts?

Your email address will not be published.