Kite 0.18.0 adds custom InputFormat support

With Kite’s recent 0.18.0 release, you can now use Kite’s command-line tools to import data using custom InputFormats. This provides is a great way to update older data sets without writing much, if any, code.

For example, it’s common if you’ve been using Hadoop for a while to have data in sequence files. Sequence files wrap plain Java objects with no consistent API, other than that they implement Hadoop’s Writable interface. Consequently, you have to provide code to work with them and that makes using that data more difficult than could be. With other formats like Avro, you can take advantage of SQL frameworks like Hive or Impala.

In this post, I’ll show how to use Kite to convert a sequence file to an Avro or Parquet dataset. The sample data is a sequence file of ZipCode objects, based on the US zip code data available from JSON Studio. The ZipCode class is a simple Writable that lives in zips.jar (source repository).

 

Importing SequenceFile data with Kite

 

The first step is to create an Avro schema for the ZipCode class using Kite’s obj-schema command:

$ kite-dataset obj-schema org.kitesdk.examples.ZipCode --jar zips.jar -o zips.avsc

 

The schema provides a description of the data that is independent of the class it was derived from. That makes it possible to work with the data without the original Java class once the data is converted. Next, create a dataset for the data using the create command. This creates an unpartitioned dataset in Hive called zips_from_seq:

$ kite-dataset create zips_from_seq --schema zips.avsc

 

Now that the target dataset is ready, use the inputformat-import command to run the import. This is backed by a MapReduce job that reads from the given input path, which in this case is hdfs:/user/me/zips.sequence:

$ export HADOOP_CLASSPATH=zips.jar # see CDK-918 and note below
$ kite-dataset inputformat-import hdfs:/usr/me/zips.sequence zips_from_seq \
   --format org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \
   --jar zips.jar
Added 29467 records to "zips_from_seq"

 

After the import completes, the data is available in Hive and to Kite’s other command-line tools.

$ kite-dataset show zips_from_seq -n 2
{"_id": "01001", "city": "AGAWAM", "loc": [-72.622739, 42.070206], "pop": 15338, "state": "MA"}
{"_id": "01002", "city": "CUSHMAN", "loc": [-72.51565, 42.377017], "pop": 36963, "state": "MA"}

 

You can still write MapReduce jobs that work with the original ZipCode class and use existing mappers without modification. But the ZipCode is class is no longer required to read the data. The best part is that it takes 0 lines of code and only a few CLI commands to make data accessible to more tools.

You can learn more about Kite at kitesdk.org.

 

Notes

  • This is also useful for adding partitioning to a dataset that has grown too large. Kite will automatically shuffle the data for the partitioning configured on the target dataset.
  • Because SequenceFileInputFormat reuses records and Crunch’s MemPipeline buffers records in memory, you should only import from files in HDFS. Otherwise, SequenceFileInputFormat will alter the buffered records.
  • Adding zips.jar to the classpath will be fixed in CDK-918
  • On the CDH QuickStart VM, you may need to export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce/
Tweet about this on TwitterShare on FacebookShare on LinkedIn


'Kite 0.18.0 adds custom InputFormat support' has no comments

Be the first to comment this post!

Would you like to share your thoughts?

Your email address will not be published.