Sqoop2: Activity, Finally!

 

Sqoop is a data transfer tool to move data between traditional datastores and Hadoop. Sqoop has gone through several stages and changes; The latest of which is Sqoop2. In this blog post, we’ll take a glance at Sqoop2, its current status, and where it’s heading.

High level

To kick things off, let’s take a look at the high level design. Sqoop2 is a service that provides a separation of concerns such managing how to connect to a data source (connection) and how to transfer data (job). It comes with a thin-client that communicates with a server that manages connections and jobs.  Hopefully the following picture helps explain the architecture of Sqoop2:

Sqoop2 architecture

 

Sqoop2 architecture for data transfer

Sqoop aims to provide the following high level features to its users:

Import Export Incremental Import Parallelize data transfer w/ MapReduce Metastore Well Defined Connector SDK Security

The features that have a check have been implemented (but may be limited in scope), while the features with an X are yet to be implemented. We’ll walk through these features, with the exception of parallelization, in the following sections.

Importing and Exporting

The latest release of Sqoop2 (1.99.3) allows users to import/export from/to any JDBC compliant database or warehouse system. It provides a Generic JDBC Connector which manages JDBC connections. The only component in the Hadoop ecosystem currently supported is HDFS, but allows for the following data storage formats:

  • Text (CSV)
  • Sequence

What’s currently missing from this picture is:

  • HBase
  • Hive
  • Accumulo
  • Parquet
  • Avro
  • etc.

This means that Sqoop2 can definitely import and export data to Hadoop, but has a very limited scope for the time being.

Sqoop2 Server

Sqoop2 provides a service that stores/manages connection parameters (connection), job information (job), and job history (submission) in its repository. Here’s a quick list of different kinds of data that can be found in the Sqoop2 server.

Connections:

  • JDBC URL
  • Username
  • Password
  • etc.

Jobs:

  • Database table to extract data from
  • Columns to extract
  • Path to location in HDFS to write to
  • etc.

Submission:

  • Submission time
  • Counters
  • Status of the job (hopefully succeeded :))
  • etc.

These different objects are reusable and work together to transfer data. See the examples below for a quick description of how connections, jobs, and submissions work together.

Connector SDK

The Connector SDK is a well defined interface that developers can use to write custom connectors. This should enable connecting to all kinds of crazy data sources like FTPs, OAuth secured data sources, etc. Check out the connector guide for more information.

Examples

Importing from MySQL

Creating a connection

sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: mysql

Connection configuration

JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://solaris.abe.cloudera.com/test
Username: root
Password: ****
JDBC Connection Properties: 
There are currently 0 values in the map:
entry# 

Security related configuration options

Max connections: 
New connection was successfully created with validation status FINE and persistent id 1

Creating a job

sqoop:000> create job --type import --xid 1
Creating job for connection with id 1
Please fill following values to create new job object
Name: mysqltohdfs

Database configuration

Schema name: 
Table name: test
Table SQL statement: 
Table column names: 
Partition column name: 
Nulls in partition column: 
Boundary query: 

Output configuration

Storage type: 
  0 : HDFS
Choose: 0
Output format: 
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
Choose: 0
Compression format: 
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom compression format: 
Output directory: /tmp/mysqltohdfs1

Throttling resources

Extractors: 
Loaders: 
New job was successfully created with validation status FINE  and persistent id 1

Starting a job

sqoop:000> start job --jid 1
Submission details
Job ID: 1
Server URL: http://localhost:12000/sqoop/
Created by: abe
Creation date: 2014-10-09 16:26:25 PDT
Lastly updated by: abe
External ID: job_local743405977_0001

http://localhost:8080/

Connector schema: Schema{name=test,columns=[
        FixedPoint{name=id,nullable=null,byteSize=null,unsigned=null}]}
2014-10-09 16:26:25 PDT: BOOTING  - Progress is not available

Work in Progress

The Sqoop2 project has a significant effort being it to bring it to feature parity recently by the community. That being said, it’s extremely active. Here’s a brief list of works in progress.

From/To connectors

Currently, Sqoop2 is designed to import/export to/from Hadoop. The community is seeking to change that make Sqoop a generic data transfer tool. This will enable more generic connector development and transfering of data to databases that were not supported in Sqoop1. To be more concise, here’s quick list of benefits:

  • Enables generic data transfer
  • Simplifies interfaces
  • Improves connector SDK

For example, instead of:

sqoop:000> create job --type import --xid 1

It will be:

sqoop:000> create job --from 1 --to 2

Where “from” and “to” are connection IDs. For more information, check out this Jira ticket.

Intermediate Data Format

The intermediate data format is an internal feature that is a common language that connectors will communicate in. This simplifies connector development and allows for different intermediate data formats in the future (such as Avro).

For more information, check out this Jira ticket.

Security

Currently, Sqoop2 is without any real security. The future is more pleasant than the current state. According to the roadmap, Sqoop2 is slotted to have kerberos support in 1.99.5.

Looking even further…

In Sqoop2, metadata and components were thought about ahead of time. Also, rather than having just a command line client, Sqoop is offered as a service. The benefit of offering it as a service is pretty obvious: security. Sqoop as a service enables development of features like RBAC or ACLs.

By Sqoop2 makes a distinction between how we connect to a data source (connection) and how we want to transfer data (job), a separation of concerns is possible. For example, a group of admins could manage connections and a group of users manage or run jobs.

Hopefully in the near future, we’ll see some of these developments as well.

Roadmap

A roadmap more Sqoop 2 is available here.

Last words

There’s tons of work being done on Sqoop2. The community is working towards feature parity with Sqoop1 and has provided a roadmap that describes how it will be developed. Sqoop2 will focus more on being a generic data transfer tool and providing data transfer as a service. This enables a better security and persists metadata for future use. Though I’d say Sqoop2 is not ready for production use cases, it’s definitely on its way.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Sqoop2: Activity, Finally!' have 5 comments

  1. February 16, 2015 @ 9:49 am Pawel

    The link should be https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2+Roadmap. Remove “s” at the end.

    Reply

  2. February 16, 2015 @ 9:58 am Pawel

    “Sqoop2 is slotted to have kerberos support in 1.99.4″
    At the time of reading, it is meant to be 1.99.5

    Reply

  3. Abraham Elmahrek

    February 17, 2015 @ 7:42 pm Abraham Elmahrek

    Thanks for the once over Pawel!

    Reply


Would you like to share your thoughts?

Your email address will not be published.