PostgreSQL repository added to Sqoop2

If you’ve been watching Sqoop2 development the past few months, you’ve probably noticed a lot has changed. In spirit of growth and change, I’ve added another (not embedded) repository: PostgreSQL.

PostgreSQL

 

This is really important for Sqoop2 for a few reasons:

  • PostgreSQL is a mature database that has HA deployments
  • Apache Derby is no longer your only option (no problems with Derby, but we need to support more databases)
  • The stage is set for more databases in the future

Design Details

Sqoop2 has a few different components that need to be implemented for database support. Here’s a diagram which briefly describes the repository design:

Sqoop2 Repository Design

The Repository Manager chooses which provider to use. Similarly, the Repository Provider manages the Repository and Repository Handler that Sqoop2 will use. It’s easy to see that the Repository Manager and Repository Provider really exist to bootstrap the Repository and Repository Handler components, which do the majority of actual work.

There’s a lot of overlap in how PostgreSQL can be accessed and how Apache Derby is accessed. They all provide the same interface: The JDBC 4 interface. There are minor differences in the SQL standard they implement, but provide a base set of features that generally overlap. To account for the differences in SQL each database implements, Sqoop2 uses the Repository Handler abstraction.

The Repository provides a high level abstraction of work being performed. They call multiple methods in the corresponding Repository Handler in order to perform higher level tasks (such as creating a job). For example, Repository’s define findConnector, findJob, createJob, etc.

Here’s what we did to get the PostgreSQL repository working.

Implementation

The PostgreSQL repository provides an implementation of RepositoryHandler called PostgresqlRepositoryHandler and that’s it. The majority of higher level work is done by the Repository, which is the JdbcRepository. Also, most of the code written for the DerbyRepositoryHandler is completely re-usable because they provide very similar implementations of SQL. The common code was pulled out into a separate module and a CommonRepositoryHandler was created. The PostgresqlRepositoryHandler just extends the CommonRepositoryHandler.

Testing

Most of the effort in writing a PostgreSQL repository implementation went into testing it. There are currently 2 kinds of tests:

  1. Unit tests
  2. Integration tests

The PostgreSQL repository ONLY implements integration tests and several of them. These tests iterate over every method in the handler in order to make sure data is written correctly. Also, for the first time in Sqoop2, there are structural tests for a repository. The actual schema structure of the database is tested and verified so that whenever a schema change is made, developers have an awareness.

Difficulties

There were several small details that were hard to get right initially. here’s a quick list of them:

  • Foreign keys and constraints were difficult to get 100% correct since there are so many and they need to line up with the Apache Derby implementation
  • Quoting is a real pain in relational databases. PostgreSQL and Apache Derby have their quoting preferences.
  • Integration tests had a circular dependency: sqoop-repository-postgresql relied on sqoop-test which relied on sqoop-repository-postgresql. In order to get around this, a sqoop-common-test packages was created.

To summarize…

The PostgreSQL repository is ready for use and I highly suggest it. For instructions on how to get started, check out Cloudera’s documentation.

Feel free to reach out to team@ingest.tips if you have any thoughts, questions, or concerns. Or just leave a comment!

Tweet about this on TwitterShare on FacebookShare on LinkedIn


'PostgreSQL repository added to Sqoop2' has no comments

Be the first to comment this post!

Would you like to share your thoughts?

Your email address will not be published.