Pivotal Greenplum®-Spark® Connector v1.6

Pivotal Greenplum-Spark Connector 1.6.0 Release Notes

The Pivotal Greenplum-Spark Connector supports high speed, parallel data transfer between Greenplum Database and an Apache Spark cluster using:

  • Spark’s Scala API - programmatic access (including the spark-shell REPL)

Pivotal Greenplum-Spark Connector 1.6.0 is a minor release of the Greenplum Database connector for Apache Spark. This release includes new and changed features and bug fixes.

Supported Platforms

The following table identifies the supported component versions for Pivotal Greenplum-Spark Connector 1.6.0:

Greenplum-Spark Connector Version Greenplum Version Spark Version Scala Version
1.6.0 4.3.x, 5.x 2.1.2 and above 2.11
1.5.0 4.3.x, 5.x 2.1.2 and above 2.11
1.4.0 4.3.x, 5.x 2.1.1 2.11
1.3.0 4.3.x, 5.x 2.1.1 2.11
1.2.0 4.3.x, 5.x 2.1.1 2.11
1.1.0 4.3.x, 5.x 2.1.1 2.11
1.0.0 4.3.x, 5.x 2.1.1 2.11

Refer to the Pivotal Greenplum Database documentation for detailed information about Pivotal Greenplum Database.

See the Apache Spark documentation for information about Apache Spark version 2.1.2.

New Features

Pivotal Greenplum-Spark Connector 1.6.0 includes the following new feature:

  • Finer-Grained Control Over the Connector Server Address

    The Greenplum-Spark Connector exposes new options to specify the gpfdist server process address on the Spark worker node. Refer to Configuring the Connector Server Address for additional information about these options.

Changed Features

Pivotal Greenplum-Spark Connector 1.6.0 includes the following changes:

  • connector.port Option is Replaced and Deprecated

    The Greenplum-Spark Connector no longer uses the connector.port option. The Connector now uses an option named server.port to identify the server port number.

Resolved Issues

The following issues were resolved in Pivotal Greenplum-Spark Connector version 1.6.0:

Bug Id Summary
29589 A read operation using the Greenplum-Spark Connector failed when the hosts in the Spark cluster were configured with multiple network interfaces. Greenplum Database was unable to access a gpfdist server process that the Connector started on an internal network interface. The Greenplum-Spark Connector now exposes options that a Spark application can use to explicitly specify the gpfdist server process hostname, IP address, or network interface on a Spark worker node.
29606 Due to a suboptimal table metadata query, the Greenplum-Spark Connector failed to read from a Greenplum Database view that contained greater than ten thousand rows. This issue is resolved. The Connector now uses a different query to obtain Greenplum table metadata.

Known Issues and Limitations

Known issues and limitations related to the 1.6.0 release of the Pivotal Greenplum-Spark Connector include the following:

  • The Greenplum-Spark Connector supports basic data types like Float, Integer, String, and Date/Time data types. The Connector does not yet support more complex types. See Greenplum Database <-> Spark Data Type Mapping for additional information.