How To: Oozie + Sqoop + Hive

Importing data directly into Hive is a great way to quickly enable your Hadoop desires. Some times it’s useful to schedule such a job in an Oozie workflow. In this post, we’ll investigate using Sqoop to import data from MySQL into Hive executed by Oozie.

The Basics

Using Sqoop to import data into hive via Oozie can be accomplished in 3 easy steps:

  1. Copy your hive-site.xml into HDFS
  2. Copy your mysql JDBC jar into HDFS
  3. Create an Oozie workflow that has a Sqoop action, includes the hive-site.xml in a “file” element of the action, and includes the mysql JDBC jar in a “archive” element of the action
  4. Run your Oozie workflow

Example

Here is an example workflow:

<workflow-app name="sqoop-to-hive" xmlns="uri:oozie:workflow:0.4">
    <start to="sqoop2hive"/>
    <action name="sqoop2hive">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <command>import --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table test --hive-import --hive-table test</command>
            <archive>/tmp/mysql-connector-java-5.1.31-bin.jar#mysql-connector-java-5.1.31-bin.jar</archive>
            <file>/tmp/hive-site.xml#hive-site.xml</file>
        </sqoop>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed</message>
    </kill>
    <end name="end"/>
</workflow-app>

And its corresponding hive-site.xml:

<configuration>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://sqoop2.example.com:9083</value>
  </property>

</configuration>

Key points

  1. Hive needs to be configured to use a remote metastore. This is because Sqoop will be ran from any node in your MapReduce cluster. hive-site.xml should be included since Sqoop needs to be able to tell Hive which metastore to write to at minimum. To do this, we set hive.metastore.uris in the hive-site.xml. In the above example, hive.metastore.uris is set to “thrift://sqoop2.example.com:9083″.
  2. hive-site.xml should be uploaded to HDFS and included in the workflow.xml. In the above example, the hive-site.xml file has been uploaded to /tmp/hive-site.xml in HDFS.
  3. The MySQL JDBC jar should be uploaded to HDFS and included in the workflow.xml. In the above example, the MySQL JDBC jar file has been uploaded to /tmp/mysql-connector-java-5.1.31-bin.jar.
  4. Sqoop first imports all data into HDFS, then imports that data into Hive. To do so, Sqoop creates a hive script file and calls the the hive command line directly.
  5. Make sure to include the share lib by setting oozie.use.system.libpath to “true” in the oozie job configuration.

Trouble shooting

Q: File already exists exception in task logs:

ERROR org.apache.sqoop.tool.ImportTool  - Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory <path> already exists

A: Sqoop imports into hive first transfers the data to ‘/user/<username>/<table>’ in HDFS, then imports the data into Hive. If this process failed before, then the import directory may already exist. The solution is to remove that directory.

Q: Table already exists exception in task logs:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table test already exists)

A: The –create-hive-table option is being used and the table already exists in Hive. The table can either be dropped or the –create-hive-table option can be removed from the command.

Q: Hive import is starting, but seeing an error before finishing.

Typically log messages like:

INFO  org.apache.sqoop.hive.HiveImport  - Loading uploaded data into Hive

before:

Intercepting System.exit(1)

are seen.
A: Normally, the hive-site.xml is missing, not in workflow.xml, or not correctly configured. See the example and “key points” section above.

Q: Class not found exception in the launcher job logs:

java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SqoopMain not found

A: The share lib hasn’t been included! It can be included by including oozie.use.system.libpath=true in the job configuration passed when submitting the job.

Q: Missing driver class in launcher job logs:

ERROR org.apache.sqoop.Sqoop  - Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

A: Driver can’t be found. Including the JDBC driver jar in “archives” section of your workflow should fix this.

Summary

Topics covered include an overview of “key points”, basic troubleshooting, and a simple walk through to Sqoop data from MySQL to Hive. Hopefully this helps you schedule your sqoop to hive jobs.

The code examples are available at https://github.com/ingesttips/examples.

Have any suggestions? Feel free to tell us what you think by shooting us an email at team@ingest.tips!

 

Tweet about this on TwitterShare on FacebookShare on LinkedIn

Tagged:


'How To: Oozie + Sqoop + Hive' have 6 comments

  1. March 26, 2015 @ 10:06 am piyush

    I am doing the steps exactly as specified but I am getting a particular error which is
    “Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]”;
    My workflow file has the following content:-

    cldx-1435-1276:8032
    hdfs://cldx-1435-1276:8020

    mapred.compress.map.output
    true

    sqoop import –connect jdbc:mysql://172.17.70.73:3306/d1 –username=root –password=admin –table t1 -m 1
    /user/oozie/mysql-connector-java-5.1.31.jar#mysql-connector-java-5.1.31.jar

    I am trying to run it from java client and the workflow has two actions ;
    one for creating the hive database and one as stated above;
    The hive database is getting created but the latter task is getting failed/Killed with the error I specified;
    Have looked everything that I could find and have not been able to solve it;
    Can you specify what might cause this problem;

    thanks;

    Reply

  2. June 27, 2015 @ 1:01 pm Himanshu

    Hi Abraham
    Nice article.
    What is the best practice around Oozie Exception/Error handling?
    We have Hive Actions within Oozie workflows and find that the errors are not logging with enough detail. We need more of stack trace and more context around each failure.
    Any suggestions?
    Thanx
    Himanshu

    Reply

  3. Abraham Elmahrek

    June 29, 2015 @ 5:01 pm Abraham Elmahrek

    Hey Himanshu,

    I think Hive action error logging can be increased by setting a few hive-site.xml properties:
    – hive.log4j.file
    – hive.exec.log4j.file

    Those configuration properties take files that can be used to configure logging for Hive.

    Oozie has similar facilities to increase logging output. Find your log4j.properties file under /etc/oozie/conf/.

    For Oozie action exception/error handling, you can provide an “error” path: https://oozie.apache.org/docs/4.0.0/WorkflowFunctionalSpec.html#a3.2.1.3_Actions_Have_2_Transitions_ok_and_error.

    Reply

  4. July 23, 2015 @ 1:58 pm kalyan

    Hello ABRAHAM ELMAHREK,

    Thanks for awesome article. can we get to know how can we export data in hive table to sqoop using oozie.
    In out project we have a hive table and have businesses logic as view on top of that hive.
    We are not sucessfull sqooping the data from this hive view to oracle database.

    regards
    Kalyan

    Reply

    • Abraham Elmahrek

      August 6, 2015 @ 5:13 am Abraham Elmahrek

      Hey Kalyan,

      I recommend simply exporting the HDFS directory. Sqoop1 doesn’t support exporting from Hive directly at the moment. In Sqoop2, the kite connector should eventually have some support for this though.

      Reply


Would you like to share your thoughts?

Your email address will not be published.