Kite Adds JSON Support

Kite’s CSV format support is one of its most popular features. It provides a quick way to get CSV data into a recommended format (Avro or Parquet), without writing an Avro schema by hand or deal directly with file layout.

In the recent 0.18.0 release, Kite adds the same level of support for JSON. Kite can now read both line-separated and concatenated JSON records (for details, see this article on streaming JSON). In this post, I’ll walk through creating a sample dataset from JSON records.

 

Importing JSON records

 

The data I’m using is the US zip code data from JSON Studio. Each record is a US zip code, along with the city, state, population, and geo coordinate. Uncompressed, the data is about 3MB.

To create a dataset, you first need a schema that describes the zip code records. Use the new json-schema command to build a schema from the first few records in the zips.json data file:

$ kite-dataset json-schema zips.json --class ZipCode -o zips.avsc

Like the csv-schema command, a class name for the Avro record is required and the rest of the schema is built by inspecting the data. Kite creates a schema for each of the first 20 records, then merges them to produce an overall result. (For more, see the json-schema docs.)

Using the zips.avsc schema, create a dataset where the incoming records will be stored. This creates a Hive table called zipcodes:

$ kite-dataset create zipcodes --schema zipcode.avsc

Finally, import the data using the new json-import command:

$ kite-dataset json-import zips.json zipcodes
Added 29467 records to "zipcodes"

Kite matches the incoming data to the target dataset’s schema. If the JSON data has fields that aren’t present, they are ignored. Similarly, if a required field is missing data, the record will be rejected with an exception. This ensures that the data is well-formed according to the schema the next time it is processed. In practice, it is critical to make sure your schema accurately describes the data.

Once the import command succeeds, you can query the data with the Kite CLI or with Hive:

hive> select state, count(*) from zipcodes group by state;
AK      196
AL      567
AR      578
AZ      270
CA      1523
CO      416
...

 

Why convert?

 

If you need to work with data coming in as JSON, converting it on import is a best practice. Both recommended data formats used by Kite, Avro and Parquet, are binary formats that support compression while remaining splittable. That results in files that work well with Hadoop for efficiency, and take up less space.

ll zips.json
-rw-r--r-- 1 cloudera cloudera 3194354 Feb 12 13:29 zips.json
hadoop fs -ls /user/hive/warehouse/zipcodes
-rw-r--r--   1 cloudera hive     927265 2015-02-12 17:11 /.../zipcodes/c0073ff7-399f-4a9d-882d-88e4a6b95241.avro

 

The Avro data is less than a third of the size of the original JSON, and can be split for parallel processing.

There’s more work to do for JSON support, so I’m looking forward to getting feedback from users trying it out. Feel free to tell us what you think on the Kite mailing list, cdk-dev@cloudera.org! You can learn more about Kite at kitesdk.org.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Tagged: , , ,


'Kite Adds JSON Support' has no comments

Be the first to comment this post!

Would you like to share your thoughts?

Your email address will not be published.