A Round Trip From MySQL to Hive

Today we will show you, how to transfer data between MySQL and Hive using Sqoop. The instructions in this blog are performed on a single node Hadoop cluster with HDFS and HiveServer2 installed.

Before getting started, please make sure that you have a Hadoop cluster already. You can either setup a Hadoop cluster and Hive from scratch or try some production ready Hadoop distribution, such as Cloudera Distribution of Hadoop.

Prerequisites

Setup MySQL Data

You can skip this step, if you know where to read data from.

Assume MySQL service is installed on node1. We create a new test database test_sqoop and create a table employee_salary with two records for the demonstration.

mysql -uroot -p
mysql> create database test_sqoop;
Query OK, 1 row affected (0.00 sec)
mysql> use test_sqoop;
SET FOREIGN_KEY_CHECKS=0;
DROP TABLE IF EXISTS `employee_salary`;
CREATE TABLE `employee_salary` (
  `name` text,
  `id` int(8) NOT NULL AUTO_INCREMENT,
  `salary` int(8) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
INSERT INTO `employee_salary` VALUES ('zhangsan', '1', '5000');
INSERT INTO `employee_salary` VALUES ('lisi', '2', '5500');
commit;
 
CREATE USER 'test'@'%' IDENTIFIED BY 'test';
GRANT ALL PRIVILEGES ON test_sqoop.* TO 'test'@'%';

Please make sure that the port of MySQL service (by default 3306) is NOT blocked by firewall.

Install Sqoop

You can skip this step, if Sqoop (incl. a jdbc driver of MySQL) has been installed successfully.

  1. Download Sqoop 1.x source code from offical website. Extract the tarball to /usr/lib/sqoop.
  2. Download the latest jdbc driver of MySQL. Copy mysql-connector-java-*.jar to /usr/lib/sqoop/lib.
  3. Add /usr/lib/sqoop/bin to search path.
  4. Type sqoop help to check, if Sqoop has been installed successfully.

Check JDBC Connectivity

The following command is used to list all database that MySQL user test can read. Expectedly there will be two tables found: information_schema and test_sqoop.

sqoop list-databases --connect jdbc:mysql://node1 --username test --password test information_schema test_sqoop

We can also use sqoop list-tables to list all tables in particular database. As we have created a table employee_salary in test_sqoop database, the following command should return that table.

sqoop list-tables --connect jdbc:mysql://node1/test_sqoop --username test --password test employee_salary

Transfer Data From MySQL to Hive

A basic example is to transfer all data records from a MySQL table to a Hive table.

sqoop import \
      --connect jdbc:mysql://node1/test_sqoop --username test --password test \
      --table employee_salary \
      --hive-import --hive-table hive_test1 \
      --m 1

Sqoop will create a Hive table first. If the Hive table exists, data will be appended to it. You can use Hive client or Hue to check, whether all data are imported expectedly. You can overwrite the table by adding argument –hive-overwrite. The argument -m 1 is optional. It specifies the number of mappers. The argument will not only impact the performance, but also determine the number of physical file partitions.

Let’s move on. Now there is a new record inserted into the MySQL table.

mysql> insert into employee_salary values('wangwu',3,6000);
Query OK, 1 row affected (0.00 sec)
mysql> commit;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from employee_salary;
+----------+----+--------+
| name     | id | salary |
+----------+----+--------+
| lisi     |  2 |   5500 |
| zhangsan |  1 |   5000 |
| wangwu   |  3 |   6000 |
+----------+----+--------+
3 rows in set (0.00 sec)

You might want to know, how to append these records to the Hive table? Assume the table is indexed by column id. You can add arguments –check-column id –incremental append –last-value 2. Sqoop will only import records, where the id is greater than 2.

Transfer From Hive to MySQL

Transferring data in the other direction is a bit restricted. Hive table can be stored in different surfaces. Sqoop 1 can only pull from HDFS. So you have to specify the export directory, which is normally at /user/hive/warehouse/{$DATABASE}/{$TABLE}. Note that Sqoop 2 (beta) does not have the limitation.

Not like import from MySQL, database table should exist already. Sqoop will NOT create a table with expected schema automatically. The following command will attempt to import data from a Hive table to MySQL.

sqoop export \
      --connect jdbc:mysql://node1/test_sqoop --username test --password test \
      --table employee_salary \
      --export-dir /user/hive/warehouse/test_sqoop.db/employee_salary/ \
      --input-fields-terminated-by '\0001'

When the data transferring is done, you can verify whether the target table has there records. In case of any errors, data will be partially written to the target table.

Conclusion

The article shows how to use Sqoop to bridge relational database and the Hadoop system. To explore more, please visit http://sqoop.apache.org for more.

Thanks debugo for permitting us to use their blog.

Tweet about this on TwitterShare on FacebookShare on LinkedIn


'A Round Trip From MySQL to Hive' have 3 comments

  1. May 21, 2015 @ 3:56 am Leo

    Good~

    Reply

  2. May 26, 2015 @ 7:59 pm Ray

    Qian, this is a great tip. Thanks for sharing with the world.

    Reply

  3. September 4, 2015 @ 8:46 pm PI

    thanks for clear steps.

    Reply


Would you like to share your thoughts?

Your email address will not be published.