Parquet Support Arriving in Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases, enterprise data warehouses and NoSQL systems. Sqoop can process data in various file formats, including CSV files, sequence files or Avro data files. With the growth of Parquet popularity, there are strong requirements of adding Parquet support.

Now we are very pleased to introduce the Parquet Sqoop1 integration. Bringing Parquet support to Sqoop1 was a co-engineering effort of Intel and Cloudera. It is worth mentioning that Kite SDK is behind the scene. We took advantage of Kite’s utilities to simplify the Parquet file handling rather than using parquet-mr from the scratch.

Importing Data

The sqoop import command is used to import a specified table from a database to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation.

Technically speaking, Sqoop imports data in two steps. First, Sqoop introspects the database to gather the necessary metadata for the data being imported. Second, Sqoop submits a map-only MapReduce job. The job transfers actual data using the metadata captured in the previous step.

Importing Data into HDFS

The general syntax is described as follows:

sqoop import \
--connect <JDBC connection string> \
--username <username> --password <password> \
--table <tablename> \
--target-dir <hdfs-dir>
  • The argument –connect specifies the JDBC connect string.
  • The arguments –username and –password specify the authentication data.
  • The argument –table specifies the table to read. Alternatively you can use –query, which means to import data from an arbitrary SQL query.
  • The argument –target-dir specifies the destination in HDFS.
  • Optionally you can specify the number map tasks by adding -m 1.

You can import data to HDFS in Parquet format. You should specify it explicitly by using the –as-parquetfile argument. With this argument, Sqoop will write data records from a table in binary representation as Parquet files. To write Parquet files, a schema is required to represent types in memory. Our implementation chooses Avro. Imported Parquet files can be utilized by other MapReduce jobs or tools which are capable accessing HDFS files.

For example, if you want to import all data of table users from a MYSQL database to HDFS location hdfs://jobtracker-host:jobtracker-port/import/mysql/users with only one map task, the command can be:

sqoop import \
--connect jdbc:mysql://the_mysql_server/the_mysql_database \
--username the_mysql_user --password the_mysql_password \
--table users \
--target-dir /import/mysql/users \
--as-parquetfile -m 1

When import is done successfully, Parquet files will be created at hdfs://jobtracker-host:jobtracker-port/import/mysql/users. We use UUID as filenames to avoid potential name conflicts. Before doing a new import, Sqoop will check, whether the target directory exists (or is clean). More specifically, Sqoop will check whether a hidden sub-directory .metadata exists already. The directory contains the information that is required for an export. In case of its presence, the import will fail.

Importing Data into Hive

If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Sqoop will convert data from the native data types within the external datastore into the corresponding types within Hive.

Importing data into Hive is as simple as adding the –hive-import argument and –hive-table argument to your Sqoop command line. Note that for Hive import the –target-dir argument is not required.

For example, if you want to import all data of table users from a MYSQL database to Hive table users with only one map task, the command can be:

sqoop import \
--connect jdbc:mysql://the_mysql_server/the_mysql_database \
--username the_mysql_user --password the_mysql_password \
--table users \
--hive-import –hive-table users \
--as-parquetfile -m 1

Similar to HDFS import, before doing a new import, Sqoop will check, whether the target Hive table exists already. In case of its presence, the import will fail.

Importing Data in Append Mode

Sqoop is able to append data to an existing dataset by adding the argument –append.

In append mode, files will be created in a temporary location first. If all data is imported successfully, they will be moved to the destination. In case of any errors, the import is considered as failed. Temporary created files will be destroyed.

Here is an example of appending data into an existing Parquet dataset.

sqoop import \
--connect jdbc:mysql://the_mysql_server/the_mysql_database \
--username the_mysql_user --password the_mysql_password \
--table users \
--target-dir /import/mysql/users \
--as-parquetfile -m 1\
--append

Note that if target dataset does not exist, append mode will do a normal import.

Exporting Data

The sqoop export command exports a set of files from HDFS back to a relation database. Technically speaking, Sqoop exports data in two steps. First, Sqoop introspects the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

The general syntax is described as follows:

sqoop export \
--connect <JDBC connection string> \
--username <username> --password <password> \
--table <tablename> \
--export-dir <hdfs-dir>
  • The argument –connect specifies the JDBC connect string.
  • The arguments –username and –password specify the authentication data.
  • The argument –table specifies the table to populate. Alternatively you can use –call, which means to a stored procedure to call.
  • The argument –export-dir specifies the source path in HDFS.
  • Optionally you can specify the number map tasks by adding -m 1.

For example, if you want to populate a table named users from a set of Parquet files at hdfs:/import/mysql/users, the command can be:

sqoop export \
--connect jdbc:mysql://the_mysql_server/the_mysql_database \
--username the_mysql_user --password the_mysql_password \
--table users \
--export-dir /import/mysql/users

In compare with import, there is no need to specify file format. Sqoop will determine export file format by reading the first couple of bytes from files. If file is corrupt, the export will fail consequence.

Another example is to populate a table name users from Hive warehouse. The command can be:

sqoop export \
--connect jdbc:mysql://the_mysql_server/the_mysql_database \
--username the_mysql_user --password the_mysql_password \
--table users \
--export-dir /user/hive/warehouse/users

A limitation should be mentioned here that Sqoop can only export data that was originally imported using sqoop-import –as-parquetfile command.

Future Works

There are still some limitations that will be definitively improved in coming releases:

  1. Parquet files contents are compressed using Snappy codec. Ideally user can specify other compression codecs, such as deflate (SQOOP-1391).
  2. Partitioning is currently not supported.
  3. Current implementation requires the export directory contains .metadata information. In other words, it can only export data that was originally imported using sqoop-import –as-parquetfile command. Here is some discussion.
  4. Parquet support is only for Sqoop1. The next generation of Sqoop expects the same functionality.
Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Parquet Support Arriving in Sqoop' have 6 comments

  1. December 19, 2014 @ 10:59 am Mike England

    Hi Qian,

    Thanks for the good work on integrating Parquet with Sqoop! I understand this is targeted for a 1.4.6 release, do you have an idea when this may be?

    Cheers,
    Mike

    Reply

    • December 24, 2014 @ 2:27 am Qian Xu

      You can build your own Sqoop from trunk or try CDH 5.3, which includes a most recent Sqoop 1.4.x with Parquet support enabled.

      Reply

  2. June 25, 2015 @ 11:39 pm Satyajit

    Was trying ,Importing Data into Hive with apache sqoop 1.4.6, which for some reason give me the following error.

    15/06/25 16:24:31 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
    15/06/25 16:24:31 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive:/default/pages. Check that JARs for hive datasets are on the classpath.
    org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive:/default/pages. Check that JARs for hive datasets are on the classpath.
    at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
    at org.kitesdk.data.Datasets.create(Datasets.java:228)
    at org.kitesdk.data.Datasets.create(Datasets.java:307)
    at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107)
    at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:89)
    at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:108)
    at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
    at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
    Do suggest any workaround for the above error.

    Reply

    • July 1, 2015 @ 6:25 am Qian Xu

      It seems Kite SDK does not found Hive library, so that the Hive dataset scheme is not registered internally. Please try to set environment variables by `export HIVE_HOME=/path/to/hive; export HCAT_HOME=/path/to/hcat;` and then retry your command.

      Reply

      • July 16, 2015 @ 7:32 pm Satyajit

        Hi Qian,

        Thank you for responding, will try as suggested above.
        I was trying to setup sqoop on intellij, which i succesfull completed.But when i try to change the source code and run on local file system, it works fine.But when i try to pass hdfs file paths, it says Wrong FS: hdfs Expected: file:///.

        the problem is when i try to change the source code and run from eclipse, things run fine on local filesystem. But when i package the changes and run from command line on HDFS i getting errors.

        Assuming from eclipse the MR doesnt run, which runs while running from commandline and facing different errors.

        Could you please help me in setting up intellij for hdfs , so that MR jobs run and i get a chance to debug the issue that i am falling into.

        Regards,
        Satyajit.

        Reply

        • July 16, 2015 @ 10:01 pm Satyajit

          Hi,
          So i figured the way was able to progress adding -fs and giving the namenode address.
          Now i end up at ERROR,
          Unkown dataset URI: hdfs://—–/user/. Check that JARs for HDFS datasets are on the classpath.
          And i face this error when i run from intellij , but not from command line.

          And i have HIVE_HOME and HCAT_HOME set in .bashrc as path variable.
          Note: i am trying to import data only on to hdfs and not on to hive tables.

          Reply


Would you like to share your thoughts?

Your email address will not be published.