Playing with Kite in Sqoop2

Kite is a high-level data layer for Hadoop. Kite’s API is built around datasets. A dataset is a consistent interface for working with your data. Datasets are uniquely identified by URIs, e.g. dataset:hive:hive_db/hive_table. You have control of implementation details, such as whether to use Avro or Parquet format, HDFS or HBase storage, and snappy compression or another. You only have to tell Kite what to do; Kite handles the implementation for you. For more information on Kite’s URI, checkout the Kite documentation.

Recently, a new Kite connector shipped with Sqoop 2 (1.99.5), which enables you to transfer data from/to Hive and HDFS. Today we will demonstrate, how to use the Kite connector in Sqoop 2. All commands in the blog are executed by the Sqoop’s command-line tool. Before getting started, please run show connector to make sure the Kite connector is loaded properly.

The Configuration

To use the Kite connector, you need to create a link for the connector and a job that uses the link.

Link Configuration

Inputs associated with the link configuration include:

Input Type Description Example
Authority String The authority of the kite dataset. example.com:8020
metastore:9083
Optional. See note 1 below.

Notes

  1. The authority is an optional fragment hostname:port after the protocol scheme. It can be either a HDFS namenode address or Hive metastore address. The benefit is that you can specify authority for multiple jobs, so that you will only have to modify the authority string rather than changing URI of all related jobs, if only the authority is changed.

FROM Job Configuration

Inputs associated with the Job configuration for the FROM direction include:

Input Type Description Example
URI String The Kite dataset URI to use. dataset:hdfs:/path/to/data
dataset:hive:some_db/hive_table
Required. See notes below.

Notes

  1. The dataset URI and the authority from the link configuration will be merged to create a complete dataset URI internally. If the given URI contains a valid authority string, the global authority string from the link configuration will be ignored.
  2. Only hdfs and hive are supported currently.

TO Job Configuration

Inputs associated with the Job configuration for the TO direction include:

Input Type Description Example
URI String The Kite dataset URI to use. dataset:hdfs:/path/to/data
dataset:hive:some_db/hive_table
Required. See notes below.
 File format Enum  The format of the data the Kite dataset should write out. PARQUET
Optional.

 

Notes

  1. The URI and the authority from the link configuration will be merged to create a complete dataset URI internally. If the given URI also contains a valid hostname:port segment, the authority from the link configuration will be ignored.
  2. Only hdfs and hive are supported currently.
  3. Overwrite mode and append mode are not supporred currently.

Example

Transfer Data from MySQL to HDFS

Here are the steps:

  1. Create a link configuration for Kite connector by typing create link -c 2 (assume the connector id of Kite connector is 2)
    1. Type kite_link1 as name.
    2. Type namenode:8020 as authority value (assume the HDFS namenode is namenode:8020)
    3. Now the link is created. The link id is assumed to be 1.
  2. Create a link configuration to Generic JDBC connector. The link id is 2 and name is mysql_link1.
  3. Create a job that reads from mysql_link1 and writes to kite_link1 by running create job -f 2 -t 1.
    1. Type example1_mysql_to_hdfs as name.
    2. Fill up manatory fields that are required by GenericJdbcConnector.
    3. Type dataset:hdfs:/data/example1 as Dataset URI.
    4. Choose AVRO or PARQUET as file format. Note that the format CSV is supported experimentally
    5. The settings of throttling resources are optional.
    6. Now the job is created. The job id is assumed to be 1.
  4. Type start job -j 1 -s to watch the job getting executed.

During the initializing phase, Kite will check whether target dataset exists already. The current implemention does not support overwrite or append. During the loading phase, Kite is used to write several temporary datasets. The number of temporary datasets is equivalent to the number of loaders that are being used. The Kite connector only creates one partition currently. During the destroying phase, the Kite connector TO destroyer will merge all the temporary datasets into a single dataset.

To transfer data back from the HDFS location to MySQL, run create job -f 1 -t 2 and fill the dataset URI, etc. The steps are quite similar.

Transfer Data from MySQL to Hive

To write data to Hive table hive_table1 in database hive_db1, similiar create another job with Dataset URI equals dataset:hive:hive_db1/hive_table1.

Future Work

There are still potential work for the Kite connector, such as incremental data import, HBase read and write support, multiple partitioner support. Hope you enjoy the new connector.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Tagged: , , , , , ,


'Playing with Kite in Sqoop2' has no comments

Be the first to comment this post!

Would you like to share your thoughts?

Your email address will not be published.