Sqoop 1.99.4 Release

Introduction

Sqoop 1.99.4 is the first release of Sqoop2 in roughly a year. It has gone through a few significant changes and is starting to look more like a generic data transfer tool. New features include pulling HDFS integration out into a connector, the intermediate data format, and a configuration verification tool. Also, there are a few general improvements that make Sqooping a bit easier.

New features

FROM/TO

Sqoop2 improves its design by making Hadoop the same level as any other data source. This enables use cases that don’t necessarily involve Hadoop such as:

  1. Transferring data between RDBMS systems
  2. Transferring data from/to systems that blur the definition of Hadoop (Accumulo for example)
  3. Writing a custom connector that can be used to transfer data from/to other data sources

At a high level, defining FROM/TO changes the responsibilities of the connector and framework. Originally. the responsibility break down took on the following form:

IMPORT EXPORT
Initializer  X  X
Partitioner  X
Extractor  X
Loader  X
Destroyer  X  X

Table 1: Connector responsibilities

IMPORT EXPORT
Partitioner  X
Extractor  X
Loader  X

Table 2: Framework responsibilities

Looking closely, we can see a significant overlap. For the IMPORT case, both connector and framework implement a Partitioner and Extractor. For the EXPORT case, both the connector and framework implement a Loader. In essence, the framework responsibilities were removed and the existing functionality was pulled out into a connector called the HDFS connector. Here’s the new division of responsibilities:

FROM TO
Initializer  X  X
Partitioner  X
Extractor  X
Loader  X
Destroyer  X  X

Table 3: New connector responsibilities

I’ve presented on this subject at the Sqoop meetup in Hadoop World 2014. Here’s the video and slides.

For a bit more information, checkout the Sqoop2, Activity Finally! blog post as well.

Intermediate data format

The intermediate data format (IDF) is a fundamental piece of Sqoop2 that provides a common communication language for Connectors. The IDF sought to provide a fast, but powerful representation for data as it progresses through the system. To be more explicit, the goals of the intermediate data format are:

  1. Provide a common language for connectors to implement (a variant of CSV)
  2. Concisely describe the data as it moves through the system via a Schema
  3. Enable support for different data formats
Common language

All IDFs have three representations for its data: text, object array, and CSV. The common language amongst all three, which all IDFs must provide, is CSV. It was selected after assessing Sqoop’s most common use cases and connectors. Generally, users were transferring data from MySQL, PostgreSQL, and Oracle to HDFS or Hive. One of the tools each one of these databases have in common is a “dump” tool which spits out data in some kind of CSV format. Thus, CSV was chosen as the common representation of data as it moves through the system.

Schema

Schemas provide a description of what the data being transferred looks like. Sqoop schemas have several data types available to them including (but not limited to):

  1. Fixed Point
  2. Text
  3. Date
  4. Time
  5. Date Time
  6. Floating Point
  7. Decimal
  8. Bit
  9. Binary
Other IDFs

This is a powerful component that enables developers to define other data formats that Sqoop can understand. For example, it’s possible that AVRO could be used as the IDF. To do so, developers would need to extend the IntermediateDataFormat abstract class and override a few abstract methods. The IDF would have to do the following three things:

  1. Convert between AVRO and CSV
  2. Create an object array representation
  3. Manage reading and writing data

It’s expected that any new data format would extend the existing CSVIntermediateDataFormat implementation to make use of its CSV parser which converts CSV to an object array.

Command line tools

New command line features can be added to Sqoop2 using the new tool runner. At its core, it just loads a tool class and passes a subset of existing arguments to it. This enables cool new features like:

  • Server configuration validation
  • Repository dumping
  • Role management
  • Upgrade tool

Another interesting fact is the the exit status of the tool runner will depend on the success of the tool class. This should enable power users to create bash scripts with conditional logic.

Builtin tools

The only built-in tool class in Sqoop 1.99.4 is: validate configuration. Internally it’s simply starting a Sqoop server and destroying the sqoop server. This is actually goes through the entire initialization phase of the Sqoop server, so connectors, repositories, etc. will be initialized as well. Here’s a quick example invocation:

sqoop2-tool verify
Custom tools

The tool runner class is pluggable. You can use any class that’s on the classpath as the first argument and the following arguments will be passed to it. The class should extend org.apache.sqoop.tools.Tool and implement the method runTool(String[] arguments). Here’s a quick example of a tool that does nothing but say “hello”:

package org.apache.sqoop.contrib.tools;

import org.apache.sqoop.tools.Tool;

public class HelloTool extends Tool {
  @Override
  public boolean runTool(String[] arguments) {
    System.out.println("hello");
    return true;
  }
}

Then, such a tool can be used via:

sqoop2-tool org.apache.sqoop.contrib.tools.HelloTool

Improvements

Batch execution of shell commands

The Sqoop2 shell provides an interactive experience. This is nice, but is difficult to use when automation is desirable. Now, commands can be executed in batch in the shell client. I can imagine this making setup and manual testing of Sqoop2 a bit more automated. Commands take on the form: –<configuration name>-<config name>-<short input name>. Here’s the details:

  • The configuration name is one of link, from, to.
  • The config name is one of the names of the configuration configs. Typically linkConfig, fromJobConfig, and toJobConfig.
  • The short input name is the machine name of the input with out its corresponding config name. Example: tableName.

Rest API improved

There have been significant changes in nomenclature and the Rest API that should help developers and users alike avoid confusion. A few of the nomenclature changes include:

  1. Form => Config
  2. Connector is one type of Configurable
  3. Framework => Driver
  4. Driver is a second kind of Configurable

The Rest API has been changed to reflect these changes in nomenclature. Also, the rest API has been improved to be more RestFUL and logical. For more details, checkout SQOOP-1509.

Configurable FileSystem URI in HDFS Connector

After HDFS integration had been pulled out into its own connector, it had a slight upgrade. Now, the file system URI can be provided. By exposing the file system URI in the configuration, users will be able to choose which cluster they’d like to work with or which kind of file system they’d like to write to/read from. For example, the URI file:/// will tell Sqoop2 to write to/read from the local file system.

Summary

There were several new features and improvements in this release. There’s more to look forward to as well:

  1. Kerberos and impersonation support
  2. Incremental data transfer
  3. HBase and Hive support
  4. Avro and Parquet support

For more info check out the Sqoop2 roadmap.

Download Sqoop 1.99.4!

Tweet about this on TwitterShare on FacebookShare on LinkedIn


'Sqoop 1.99.4 Release' has no comments

Be the first to comment this post!

Would you like to share your thoughts?

Your email address will not be published.