Pivotal Greenplum-Spark Connector 1.3.0 Release Notes

The Pivotal Greenplum-Spark Connector supports high speed, parallel data transfer from Greenplum Database to an Apache Spark cluster.

Pivotal Greenplum-Spark Connector 1.3.0 is a minor release of the Greenplum Database connector for Apache Spark. This release includes bug fixes, new features, and improvements.

Scope

The Greenplum-Spark Connector supports loading Greenplum Database table data into Spark using:

  • Spark’s Scala API - programmatic access (including the spark-shell REPL)

Supported Platforms

The following table identifies the supported component versions for Pivotal Greenplum-Spark Connector 1.3.0:

Greenplum-Spark Connector Version Greenplum Version Spark Version Scala Version
1.3.0 4.3.x, 5.x 2.1.1 2.11
1.2.0 4.3.x, 5.x 2.1.1 2.11
1.1.0 4.3.x, 5.x 2.1.1 2.11
1.0.0 4.3.x, 5.x 2.1.1 2.11

Refer to the Pivotal Greenplum Database documentation for detailed information on Pivotal Greenplum Database.

See the Apache Spark documentation for information on Apache Spark version 2.1.1.

New Features

Pivotal Greenplum-Spark Connector 1.3.0 includes the following new features:

  • Connection Pool Configuration

    The Greenplum-Spark Connector pools JDBC connections for each Spark application. The Connector now provides configuration options to tune connection pool size and idle properties. Refer to JDBC Connection Pooling for additional information about this feature.

Changes

Pivotal Greenplum-Spark Connector 1.3.0 includes the following changes:

  • Data Assignment to Spark Workers

    The Greenplum-Spark Connector now uses Greenplum Database table statistics to partition table data among Spark worker nodes. With this scheme, the 1.3.0 Connector may assign table data to different Spark worker nodes than would be assigned by previous Connector versions.

  • Spark Worker Port Specification

    The Greenplum-Spark Connector now supports specifying a single gpfdist port number via a DataFrame option. In previous versions of the Connector, you set an environment variable named $GPFDIST_PORT to specify a single port or a list of port numbers. Refer to Network Port Requirements for more information about Greenplum-Spark Connector port requirements and configuration.

Resolved Issues

The following issues were resolved in Pivotal Greenplum-Spark Connector 1.3.0:

Bug Id Summary
155369957 The Greenplum-Spark Connector returned a java.time.format.DateTimeParseException when it parsed a timestamp format that included fractional seconds. The Connector now correctly parses such timestamp formats.
155799016 The Greenplum-Spark Connector returns a “connection limit exceeded” error when a request by a Spark application exceeds the maximum number of connections configured for the Greenplum Database server. To mitigate this error in cases where a Spark application using the Greenplum-Spark Connector is the culprit, the Connector now exposes configuration options to tune connection pool size and idle properties. Refer to JDBC Connection Pooling for information about connection pooling in the Greenplum-Spark Connector. For related troubleshooting information, see Greenplum Database Connection Errors.

Known Issues and Limitations

Known issues and limitations related to the 1.3.0 release of the Pivotal Greenplum-Spark Connector include the following:

  • The Connector does not yet support writing Spark data back into Greenplum Database.
  • The Greenplum-Spark Connector supports basic data types like Float, Integer, String, and Date/Time data types. The Connector does not yet support more complex types. See Greenplum Database <-> Spark Data Type Mapping for additional information.