Part of the ingest.tips team took a trip to Truckee in North Lake Tahoe. We rented a house and relaxed/coded the entire week. There were a lot of fun activities there including: hiking, biking, motorcycle riding (I ended up bringing my CB300F), relaxing in a hot tub, etc. It was great experience… check out the photos from the engineering in the wild blog post. I thought it would be interesting to iterate over what we achieved over this brief work-ation.
1. Kite wrapper for existing data
Kite is an API for working with data in Hadoop at a high level, like a Hive table instead of a collection of files. The extension we worked on was to make it easy to wrap existing folders of data and add configuration to them so they work as Kite datasets. For example, if you have a directory of Parquet files and want to make it available to Hive, you can now run create and point Kite at that existing directory. Kite will figure out the schema, partitioning, and that you’re using the Parquet format and create the table for you. This will be available in the upcoming 1.1.0 Kite release.
2. CSV and JSON entity parsers for the Flume Dataset Sink
The DatasetSink is a Kite-based sink that works like the Flume HDFS sink, only it handles partitioning automatically according to the Kite dataset’s configuration and also supports multiple schema versions and rolling restarts. Before, the sink only accepted Avro-encoded records, but in Tahoe we added support to parse Flume events as JSON or CSV just like the Kite command-line does.
3. Sqoop2 HBase connector
The HBase connector patch has been under construction for a little while now. The biggest blockers being the lack of testing and an unknown configuration story. Over the course of the trip, I threw together integration tests for the HBase connector in SQOOP-1883. Hopefully this will make it into the Sqoop 1.99.7 release.
4. Sqoop Kite/Hive integration
In SQOOP-1529, Sqoop is receiving a fully functional kite connector. It allows Sqoop to use the rich feature set that Kite provides such as: HDFS, Hive, and HBase integration; Avro, CSV, and Parquet support; and various compression scheme support. SQOOP-1998 was recently committed which adds support for Hive in the kite connector. This should make it in for the Sqoop 1.99.6 release.
5. Incremental data transfer
Sqoop community is aggressively pushing to finish incremental import which is one of the major features that is available in Sqoop 1. During the hack week we’ve added support of incremental import to HDFS Connector via SQOOP-1949. We’ve also made significant progress on storing the state in Sqoop 2 repository in SQOOP-1803 that will finish the feature end to end. Sadly we’ve uncover very old and unnoticed bug in Hadoop where Hadoop is not returning back correct configuration object. This was fixed on Hadoop side via MAPREDUCE-5875 and later reverted in MAPREDUCE-6288 as it caused additional issues.
You tell us!