Sqoop 1 or Sqoop 2?

Sqoop1 and Sqoop2 are completely different code paths and, as such, have very different feature sets [1]. So, how do we know when we’re using Sqoop1 or Sqoop2? Here’s a quick list of ways:

Different Usecases

1. Sqoop2 has a UI devoted to it, but Sqoop1 does not

The Hue project has a wonderful UI for using Sqoop2, but nothing specific for Sqoop1.

HueSqoop-2Picture 1: UI for Sqoop2 in Hue

You can run Sqoop1 jobs from Hue, but it has to be through the Job Designer app or the Oozie app.

Screen Shot 2014-10-21 at 10.26.50 AMPicture 2: UI for Job Designer which uses Sqoop1, not Sqoop2.

Screen Shot 2014-10-21 at 10.30.03 AMPicture 3: UI for Oozie which has an action for Sqoop1, but not Sqoop2.

2. Sqoop2 has an interactive command line interface, Sqoop1 does not

Sqoop2 and Sqoop1 have slightly different command line interfaces. Since Sqoop2 is service oriented and considers metadata first, there is a sequence of steps that’s necessary to start a job:

sqoop:000> create link --cid 1
Creating link for connector with id 1
Please fill following values to create new link object
Name: hdfs1
New link was successfully created with validation status FINE and persistent id 3
sqoop:000> create link --cid 2 
Creating link for connector with id 2
Please fill following values to create new link object
Name: mysql

Link configuration

JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://solaris.abe.cloudera.com/test
Username: root
Password: ****
JDBC Connection Properties: 
There are currently 0 values in the map:
entry# 
New link was successfully created with validation status FINE and persistent id 4
sqoop:000> create job --from 4 --to 3
Creating job for links with from id 4 and to id 3
Please fill following values to create new job object
Name: MySQLtoHDFS 

From database configuration

Schema name: 
Table name: test
Table SQL statement: 
Table column names: 
Partition column name: 
Nulls in partition column: 
Boundary query: 

ToJob configuration

Output format: 
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
Choose: 0
Compression format: 
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 0
Custom compression format: 
Output directory: /tmp/sqoop/hdfs

Throttling resources

Extractors: 
Loaders: 
New job was successfully created with validation status FINE  and persistent id 1
sqoop:000> start job --jid 1
Submission details
Job ID: 1
Server URL: http://localhost:12000/sqoop/
Created by: abe
Creation date: 2014-10-21 10:49:24 PDT
Lastly updated by: abe
External ID: job_1412928093521_0015

http://solaris.abe.cloudera.com:8088/proxy/application_1412928093521_0015/

Source Connector schema: Schema{name=test,columns=[
        FixedPoint{name=id,nullable=true,byteSize=null,unsigned=null}]}
2014-10-21 10:49:24 PDT: BOOTING  - Progress is not available

In Sqoop1, creating a link and job is unnecessary. You just need to specify every thing in the command line:

sqoop import --connect jdbc:oracle:thin:@//localhost/xe --username sqoop --password sqoop --table sqoop --hbase-table sqoop --column-family cf --hbase-row-key id --hbase-create-table --hbase-bulkload

3. Oozie has support for Sqoop1, not Sqoop2 yet

If you want to Sqoop data with Oozie, then you’re going to need to use Sqoop1.

<workflow-app name="test" xmlns="uri:oozie:workflow:0.4">
    <start to="MySQL-to-HDFS"/>
        <action name="MySQL-to-HDFS">
            <sqoop xmlns="uri:oozie:sqoop-action:0.2">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <command>import --connect jdbc:oracle:thin:@//localhost/xe --username sqoop --password sqoop --table sqoop --hbase-table sqoop --column-family cf --hbase-row-key id --hbase-create-table --hbase-bulkload</command>
            </sqoop>
            <ok to="end"/>
            <error to="kill"/>
        </action>
        <kill name="kill">
            <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
    <end name="end"/>
</workflow-app>

4. Cloudera Manager manages Sqoop2 service and Sqoop1

Cloudera manager manages both Sqoop versions (as of CM 5.0). The default Sqoop service that’s installed is the Sqoop 2 service (even though it might be named SQOOP-1).

Screen Shot 2014-10-21 at 10.54.49 AMPicture 4: Cloudera Manager SQOOP-1 service is actually Sqoop2.

Which am I Using?

A couple of good ways to know which you version of Sqoop you are using are:

1. Check for /usr/lib/sqoop or /usr/lib/sqoop2

Sqoop2 uses the /usr/lib/sqoop2 directory, while Sqoop1 uses the /usr/lib/sqoop directory. These directories should hold the binaries and content necessary for running both versions of Sqoop.

2. Check for /usr/lib/sqoop or /usr/lib/sqoop2

Sqoop2 uses the /var/lib/sqoop2 directory, while Sqoop1 uses the /var/lib/sqoop directory. These directories should hold extra JDBC drivers necessary for connecting to different data sources (such as Teradata and Netezza).

3. Check /var/log/sqoop2

Sqoop2 stores its logs in /var/log/sqoop2. If this files in this  directory are being updated, you’re likely using Sqoop2.

When in doubt about any thing, check with the Sqoop mailing list.

Summary

It’s good to know when you’re using Sqoop1 versus Sqoop2. Since integration with the rest of the Big Data ecosystem varies (for instance, there’s a Sqoop2 UI in Hue, but not a Sqoop1 UI), it can get confusing to know which version of Sqoop you’re using. Hopfully the above list cleared things up! Now go get Sqooping!

Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Sqoop 1 or Sqoop 2?' has no comments

Be the first to comment this post!

Would you like to share your thoughts?

Your email address will not be published.