(The following is a re-post from apache.org)
Sqoop 1.4.2 was released in August 2012. As this was an extremely important release for the Sqoop community – our first release as an Apache Top Level project – I would like to highlight the key features and fixes of this release. The entire change log can be viewed on our JIRA and actual bits can be downloaded from the usual place.
One of the main the goals of the previous 1.4.1-incubating release was to make sure that Sqoop works on various Hadoop versions (that it’s not locked to one specific version). We’ve made sure that Sqoop works on Hadoop 0.20, 1.0 and 0.23. We’re excited to announce that in Sqoop 1.4.2 we’ve added support for Hadoop 2.0.x (SQOOP-574) and so now Sqoop will work on four different major Hadoop releases!
Even though we made sure that the same codebase will work across all different Hadoop versions, we couldn’t ensure the same for the generated binary artifacts. Different Hadoop versions are not binary compatible with each other and there is nothing that Sqoop can do about that. You need to make sure that the Sqoop version that you’re using has been compiled against the appropriate Hadoop version that your cluster is using. To help users pick the correct Sqoop binary, all artifacts in the release have a suffix that encodes the supported Hadoop version.
One of our goals during incubation was to move the entire code base from the com.cloudera.sqoop namespace to the org.apache.sqoop namespace. The change would have broken backward compatibility for connectors compiled against Sqoop 1.3.x. Incompatibility was a showstopper for us and thus we developed this solution. Unfortunately during the code migration we overlooked one issue that affected seamless usage of Microsoft, Oracle and other special connectors. With that resolved, there are no outstanding compatibility issues and all special connectors that were built against Sqoop 1.3.x should work on 1.4.2 without any issues or required modifications.
One of many interesting and very useful features of Sqoop is incremental import: only deltas of new data are imported on an incremental basis to Hadoop (HDFS, Hive, HBase). We’ve extended this feature to support free form queries (SQOOP-444) as well.
Logic that leads to connector selection in Sqoop was fragile and very hard to predict. Sqoop supports the parameters
--connection-manager that are supposed to provide a reliable way to specify what connector and JDBC driver will be picked up. Unfortunately, by specifying the
--driver option, the built-in connection manager selection defaults to the generic JDBC connector, ignoring the
--connection-manager option and any installed third party connectors. To address those nasty issues we’ve improved the connector selection (SQOOP-529) process, so that the entire logic is now far more predictable and reliable.
The Export feature allows you to move large data volumes from the Hadoop ecosystem (HDFS, Hive) to relational database systems and data warehouses for additional processing or reporting. Until this release, the table that Sqoop was performing export into had to contain exactly the same number of columns as the exported data. This is highly undesirable in some use cases, so we’ve made sure (SQOOP-503) that the parameter
--columns that was working in import will also work for export. However exporting data to only a subset of columns brings additional requirements to the target table: other columns must have either a default value or be nullable. The export job will fail otherwise.
Verbose mode wasn’t correctly working in Sqoop 1.4.x due to the aforementioned namespace migration. We’ve made sure to fix that (SQOOP-460, SQOOP-488). In addition we’ve extended the reach of the
--verbose parameter to generated mapreduce jobs (SQOOP-436), so we now provide much better logging when needed during the entire import or export process!
The Sqoop team has improved Hive imports in several ways. The first improvement was that the user no longer needs to remove the empty temporary directory after Sqoop moves data to Hive (SQOOP-443). The main benefit is that the user now can perform multiple Hive imports with the same temporary directory without the need for extra manual steps. Until this release, each Hive import was using a temporary directory with the same name as the importing table stored in the user’s home directory. The second notable improvement was to support arbitrary temporary directories when doing a Hive import (SQOOP-483).
I believe that 1.4.2 is a great release with many improvements and interesting new features. All users are highly encouraged to upgrade to this latest release.
Subscribe to the blog to receive updates about:
AltaFlux Corporation is a global HCM cloud consulting partner based in Troy, Michigan. We empower organizations by streamlining, transforming, and optimizing key human capital management (HCM) processes with industry-leading HCM cloud solutions like SAP SuccessFactors, Benefitfocus, WorkForce Software and Dell Boomi.