It’s been more than a year and a half since a couple of my colleagues here at Cloudera shipped the first version of Sentry (now Apache Sentry (incubating)). This project filled a huge security gap in the Hadoop ecosystem by bringing truly secure and dependable fine grained authorization to the Hadoop ecosystem and provided out-of-the-box integration for Hive. Since then the project has grown significantly – adding support for Impala and Search and the wonderful Hue App to name a few significant additions.
In order to provide a truly secure and centralised authorization mechanism, Sentry deployments have been historically set up so that all Hive’s data and metadata are accessible only by HiveServer2 and every other user is cut out. This has been a pain point for Sqoop users as Sqoop does not use the HiveServer2 interface. Hence users with a Sentry-secured Hive deployment were forced to split the import task into two steps – simple HDFS import followed by manually loading the data into Hive.
With the inclusion of HDFS ACLs and the integration of Sentry into the Hive metastore in CDH 5.1, users were able to improve this situation and get the direct Hive import working again. However, this approach required manual administrator intervention to configure HDFS ACLs according to the Sentry configuration and needed a manual refresh to keep both systems in sync.
One of the large features included in the recently released CDH 5.3 is Sentry integration with HDFS that will enable Sqoop import into Hive without any additional Administrator intervention. By exposing Sentry policies – what tables a user can select from and what tables they can insert to – directly in HDFS, Sqoop will re-use the same policies that have been configured via GRANT/REVOKE statements or the Hue Sentry App and will import data into Hive without any trouble.
In order for Sqoop to seamlessly import into a Sentry Secured Hive instance, the Hadoop administrator needs to follow a few configuration steps to enable all the necessary features. First, your cluster needs to be using the Sentry Service as backend for storing authorization metadata and not rely on the older policy files.
If you are already using Sentry Service and GRANT/REVOKE statements, you can directly jump to step 3).
- Make sure that you have Sentry service running on your cluster. You should see it in the service list:
- And that Hive is configured to use this service as a backend for Sentry metadata:
- Finally enable HDFS Integration with Sentry:
Example Sqoop import
Let’s assume that we have user jarcec who needs to import data into a Hive database named default. User jarcec is part of a group that is also called jarcec – in real life the name of the group doesn’t have to be the same as the username and that is fine.
With an unsecured Hive installation, the Hadoop administrator would have to jump in and grant writing privilege to user jarcec for directory /user/hive/warehouse or one of its subdirectories. With Sentry and HDFS integration, the Hadoop administrator no longer needs to jump in. Instead Sqoop will reuse the same authorization policies that has been configured through Hive SQL or via the Sentry Hue Application. Let’s assume that user bc is jarcec’s Manager and already has privileges to grant privileges in the default database.
- bc starts by invoking beeline and connecting to HiveServer2:
[bc@sqoopsentry-1 ~]$ beeline 1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> !connect jdbc:hive2://sqoopsentry-1.vpc.cloudera.com:10000/default;principal=hive/sqoopsentry-1.vpc.cloudera.com@ENT.CLOUDERA.COM
- In case that user jarcec is not part of any role yet, we need to create a role for him:
1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> CREATE ROLE jarcec_role; No rows affected (0.769 seconds)
- And this new role jarcec_role needs to be granted to jarcec’s group jarcec.
1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> GRANT ROLE jarcec_role to GROUP jarcec; No rows affected (0.651 seconds)
- And finally bc can grant access to database default (or any other) to the role jarcec_role;
1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> GRANT ALL ON DATABASE default TO ROLE jarcec_role; No rows affected (0.16 seconds)
By executing the steps above, user jarcec has been given privilege to do any action (insert or select) with all objects inside database default. That includes the ability to create new tables, insert data or simply querying existing tables. With those privileges user jarcec can run the following Sqoop command as he was used to:
[jarcec@sqoopsentry-1 ~]$ sqoop import --connect jdbc:mysql://mysql.ent.cloudera.com/sqoop --username sqoop --password sqoop --table text --hive-import 14/12/14 15:37:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0 …. 14/12/14 15:38:58 INFO mapreduce.ImportJobBase: Transferred 249.7567 MB in 75.8448 seconds (3.293 MB/sec) 14/12/14 15:38:58 INFO mapreduce.ImportJobBase: Retrieved 1000000 records. 14/12/14 15:38:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `text` AS t LIMIT 1 14/12/14 15:38:58 INFO hive.HiveImport: Loading uploaded data into Hive 14/12/14 15:39:09 INFO hive.HiveImport: 14/12/14 15:39:09 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. 14/12/14 15:39:09 INFO hive.HiveImport: 14/12/14 15:39:09 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.26/jars/hive-common-0.13.1-cdh5.3.0.jar!/hive-log4j.properties 14/12/14 15:39:12 INFO hive.HiveImport: OK 14/12/14 15:39:12 INFO hive.HiveImport: Time taken: 1.079 seconds 14/12/14 15:39:12 INFO hive.HiveImport: Loading data to table default.text 14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=part-m-00000 14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=part-m-00001 14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=part-m-00002 14/12/14 15:39:13 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=part-m-00003 14/12/14 15:39:13 INFO hive.HiveImport: Table default.text stats: [numFiles=4, numRows=0, totalSize=261888896, rawDataSize=0] 14/12/14 15:39:13 INFO hive.HiveImport: OK 14/12/14 15:39:13 INFO hive.HiveImport: Time taken: 0.719 seconds 14/12/14 15:39:13 INFO hive.HiveImport: Hive import complete. 14/12/14 15:39:13 INFO hive.HiveImport: Export directory is not empty, keeping it.
And jarcec can easily confirm in beeline that data have been indeed imported into Hive:
0: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> show tables from default; +------------+--+ | tab_name | +------------+--+ | text | +------------+--+ 1 row selected (0.177 seconds) 0: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> select count(*) from text; +----------+--+ | _c0 | +----------+--+ | 1000000 | +----------+--+ 1 row selected (72.188 seconds)
If Hive is configured to inherit permissions, you might notice that Sqoop will print out several warnings similar to this one:
14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=part-m-00000
As there is no need to inherit HDFS permissions when Sentry is enabled in HDFS, you can safely ignore such messages.
That’s all folks!
The latest CDH release version 5.3.0 brings a bunch of new features. All Sqoop users should particularly check out the Sentry integration with HDFS as it will enable simple and straightforward import into Sentry-secured Hive deployments without the need to manually configure HDFS permissions. The same SQL interface that is used to grant access to various databases and tables is used to determine who can import (or export) data into Hive!