Kite’s CSV format support is one of its most popular features. It provides a quick way to get CSV data into a recommended format (Avro or Parquet), without writing an Avro schema by hand or deal directly with file layout.
In the recent 0.18.0 release, Kite adds the same level of support for JSON. Kite can now read both line-separated and concatenated JSON records (for details, see this article on streaming JSON). In this post, I’ll walk through creating a sample dataset from JSON records.
Importing JSON records
To create a dataset, you first need a schema that describes the zip code records. Use the new
json-schema command to build a schema from the first few records in the
zips.json data file:
$ kite-dataset json-schema zips.json --class ZipCode -o zips.avsc
csv-schema command, a class name for the Avro record is required and the rest of the schema is built by inspecting the data. Kite creates a schema for each of the first 20 records, then merges them to produce an overall result. (For more, see the json-schema docs.)
zips.avsc schema, create a dataset where the incoming records will be stored. This creates a Hive table called
$ kite-dataset create zipcodes --schema zipcode.avsc
Finally, import the data using the new
$ kite-dataset json-import zips.json zipcodes Added 29467 records to "zipcodes"
Kite matches the incoming data to the target dataset’s schema. If the JSON data has fields that aren’t present, they are ignored. Similarly, if a required field is missing data, the record will be rejected with an exception. This ensures that the data is well-formed according to the schema the next time it is processed. In practice, it is critical to make sure your schema accurately describes the data.
Once the import command succeeds, you can query the data with the Kite CLI or with Hive:
hive> select state, count(*) from zipcodes group by state; AK 196 AL 567 AR 578 AZ 270 CA 1523 CO 416 ...
If you need to work with data coming in as JSON, converting it on import is a best practice. Both recommended data formats used by Kite, Avro and Parquet, are binary formats that support compression while remaining splittable. That results in files that work well with Hadoop for efficiency, and take up less space.
ll zips.json -rw-r--r-- 1 cloudera cloudera 3194354 Feb 12 13:29 zips.json hadoop fs -ls /user/hive/warehouse/zipcodes -rw-r--r-- 1 cloudera hive 927265 2015-02-12 17:11 /.../zipcodes/c0073ff7-399f-4a9d-882d-88e4a6b95241.avro
The Avro data is less than a third of the size of the original JSON, and can be split for parallel processing.
There’s more work to do for JSON support, so I’m looking forward to getting feedback from users trying it out. Feel free to tell us what you think on the Kite mailing list, email@example.com! You can learn more about Kite at kitesdk.org.