Note: This post was originally published at blogs.apache.org in a slightly different form.
Apache Sqoop is a tool for doing highly efficient data transfers between relational databases and the Apache Hadoop ecosystem. One significant benefit of Sqoop is that it’s easy to use and can work with a variety of systems inside as well as outside of that ecosystem. Thus, with one tool, you can import or export data from all databases supporting the JDBC interface with the same command-line arguments exposed by Sqoop. Furthermore, Sqoop was designed to be modular, allowing you to plug in specialized additions to optimize transfers for particular database systems.
Many users use the words “connector” and “driver” interchangeably, but these words mean completely different things in the context of Sqoop — which can lead to confusion because both connectors and drivers are needed for every Sqoop invocation. This blog post will explain the difference between them, and how Sqoop uses these concepts when transferring data between Hadoop and other systems.
- Driver: In Sqoop, the word “driver” refers to a JDBC Driver. JDBC is a standard Java API for accessing relational databases and some data warehouses. The Java language prescribes only what classes and methods this interface contains and the JDK does not have any default implementation. So, each database vendor is responsible for writing its own implementation that will communicate with the corresponding database with its native protocol. For this reason, JDBC drivers often carry restrictive licenses that prohibit redistribution. As a result, each user needs to download the drivers separately and install them into Sqoop.
- Connector: Despite existence of a SQL standard, every database vendor has its own dialect. The basics usually work the same way across databases, although some edge conditions might be implemented differently. Also, as SQL is a very general query processing language, it might not always be optimal for importing or exporting data out of the database server.Connectors allow Sqoop to transcend SQL dialects while optimizing data transfer. A “connector” is a pluggable piece that fetches metadata about transferred data (columns, associated data types, and so on) to drive data transfer in the most efficient manner. The most basic connector that ships with Sqoop is Generic JDBC Connector, and as the name suggests, it uses only the JDBC interface for accessing metadata and transferring data. As a result, this connector will work on most databases out of the box — but may not be optimal for your use case. Sqoop also ships with specialized connectors for MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, IBM DB2, and Netezza, so there is usually no need to download extra connectors. However, there are additional connectors available that can add support for additional database systems or improve the performance of built-in connectors.
Example: Connecting to the Database
To demonstrate how drivers and connectors are used, let’s take a look at how Sqoop connects to the database.
Depending on specified command-line arguments and all available connectors, Sqoop will try to load the one offering the best performance. This process begins with Sqoop scanning all extra manually downloaded connectors to confirm if one can be used. If there are no manually installed connectors present or no installed connectors identify themselves as candidates, Sqoop will check the JDBC URL (usually starting with jdbc:) to see if you are connecting to a database for which a built-in special connector is available. (For example, for the jdbc:mysql:// URL that is used for MySQL, Sqoop will pick up the MySQL Connector, which is optimized for MySQL.) Finally, if no connector is yet confirmed for use, Sqoop will default to the Generic JDBC Connector.
If this selection mechanism does not suit your environment, you can use the argument –connection-manager with the class name of any arbitrary connector. However, keep in mind that connectors are usually designed for one specific database vendor and thus, for example, the MySQL Connector won’t work with PostgreSQL.
After selecting the connector, the next step is to choose the JDBC driver. As most connectors are specialized for a given database and most databases have only one JDBC driver available, the connector itself determines which driver should be used. (For example, the MySQL connector will always use the MySQL JDBC Driver called Connector/J.) The only exception is the Generic JDBC Connector, which isn’t tied to any database and thus can’t determine what JDBC Driver should be used. In that case, you have to supply the driver name in the –driver parameter on the command line. But be careful: Using the –driver parameter will always force Sqoop to use the Generic JDBC Connector regardless of if a more specialized connector is available. For example, if the MySQL specialized connector would be used because the URL starts with jdbc:mysql://, specifying the –driver option will force Sqoop to use the generic connector instead. As a result, in most cases, you should not need to use the –driver option at all!
Finally, after both connector and driver have been determined, the connection between the Sqoop client and the target database can be established, as shown in the diagram below. (Please note that the picture is only describing the process of opening the initial connection and roles of the connector and driver. It’s not describing all connections that Sqoop can open as, for example, the data transfer itself is done inside a MapReduce job and is not executed on the Sqoop client side.)
Now that you know all the pluggable pieces that Sqoop uses to connect to your database server, let’s look at where each piece lives.
The connector is a Sqoop-specific plugin, so it’s not surprising that it’s part of the Sqoop installation. In fact, Sqoop ships with a variety of connectors out of the box and various additional extra connectors can be downloaded and installed into Sqoop easily.
Even though drivers are database-specific pieces, created and distributed by the various database vendors, they are often not installed along with the database server. Instead, they need to be installed on the machine where Sqoop is executed because they are used by connectors to establish the connection with the database server. (For example, as previously explained, JDBC drivers are not shipped with Sqoop due to incompatible licenses and thus you must download and install one manually.) Appropriate drivers can usually be found on the database vendor’s website or are distributed separately with the database installation itself.
Connectors and drivers need to be installed only on the machine where Sqoop is executed; you do not need to install them on all nodes in your Hadoop cluster. Sqoop will propagate them itself whenever necessary.
I hope that this blog post has clarified the differences between connectors and drivers and why both are needed to transfer data between relational databases and Hadoop ecosystem. Happy Sqooping!