Pivotal Greenplum-Spark Connector 2.x Release Notes

The Pivotal Greenplum-Spark Connector provides high speed, parallel data transfer between Greenplum Database and an Apache Spark cluster using Spark’s Scala API for programmatic access (including the spark-shell REPL).

Refer to the Pivotal Greenplum Database documentation for detailed information about Pivotal Greenplum Database.

See the Apache Spark documentation for information about Apache Spark version 2.4.

Supported Platforms

The following table identifies the supported component versions for the Pivotal Greenplum-Spark Connector 2.x:

Greenplum-Spark Connector Version Greenplum Version Spark Version Scala Version PostgreSQL JDBC Driver Version
2.1, 2.0 5.x, 6.x 2.3.x , 2.4.x
2.4.x, 3.0.x

The Greenplum-Spark Connector is certified against the Greenplum, Spark, and Scala versions listed above. The Connector is bundled with, and certified against, the listed PostgreSQL JDBC driver version.

Release 2.1

Released: November 24, 2020

Greenplum-Spark Connector 2.1.0 includes new and changed features and bug fixes.

New and Changed Features

Pivotal Greenplum-Spark Connector 2.1.0 includes this new and changed feature:

The Greenplum-Spark Connector now uses external temporary tables when it loads data between Greenplum and Spark. Benefits include the following:

  • Greenplum Database external temporary tables are created and reside in their own schema; the Greenplum user reading the data is no longer required to have CREATE privileges on the schema in which the accessed Greenplum table resides.
  • Greenplum Database removes external temporary tables when the session is over; manual clean-up of orphaned external tables is no longer required. (Cleaning Up Orphaned Greenplum External Tables in previous versions of the documentation describes this now-unnecessary procedure.)
  • The Connector reuses external temporary tables; it creates fewer tables and has less of an impact on Greenplum Database catalog bloat.

Resolved Issues

The following issues were resolved in Greenplum-Spark Connector version 2.1.0:

Bug Id Summary
31083 Resolves an issue where the Connector failed to read data from Greenplum Database when the partitionColumn was gp_segment_id and mirroring was enabled in the Greenplum cluster.
31075 The developer had no way to specify the schema in which the Greenplum-Spark Connector created its external tables; the Connector always created external tables in the same schema as the Greenplum table. An undesirable side effect of this behaviour was that the Greenplum user reading a table was required to have CREATE privilege on the schema in which the table resided. This issue is resolved; the Connector now uses external temporary tables when it accesses Greenplum tables, and these temporary tables reside in a special, separate Greenplum Database schema.

Release 2.0

Released: September 30, 2020

Greenplum-Spark Connector 2.0.0 includes new and changed features and bug fixes.

New and Changed Features

Pivotal Greenplum-Spark Connector 2.0.0 includes these new and changed features:

  • The Greenplum-Spark Connector is certified against the Scala, Spark, and JDBC driver versions identified in Supported Platforms above.
  • The Greenplum-Spark Connector is now bundled with the PostgreSQL JDBC driver version 42.2.14.
  • The Greenplum-Spark Connector package that you download from Pivotal Network is now a .tar.gz file that includes the product open source license and the Connector JAR file. The naming format of the file is greenplum-connector-apache-spark-scala_<scala-version>-<gsc-version>.tar.gz.

    For example:

    • greenplum-connector-apache-spark-scala_2.11-2.0.0.tar.gz
    • greenplum-connector-apache-spark-scala_2.12-2.0.0.tar.gz
  • The default gpfdist server connection activity timeout changes from 30 seconds to 5 minutes.

  • A new server.timeout option is provided that a developer can use to specify the gpfdist server connection activity timeout.

  • The Connector improves read performance from Greenplum Database by using the internal Greenplum table column named gp_segment_id as the default partitionColumn when the developer does not specify this option.

Resolved Issues

The following issues were resolved in Greenplum-Spark Connector version 2.0.0:

Bug Id Summary
30731 Resolved an issue where the Greenplum-Spark Connector timed out with a serialization exception when writing aggregated results to Greenplum Database. The Connector now exposes the server.timeout option to specify the gpfdist “no activity” timeout, and sets the default timeout to 5 minutes.
174495848 Resolved an issue where predicate pushdown was not working correctly because the Greenplum-Spark Connector did not use parentheses to join the predicates together when it constructed the filter string.

Removed Features

The Greenplum-Spark Connector version 2.x removes:

  • Support for Greenplum Database 4.x.
  • The connector.port option (deprecated in 1.6).
  • The partitionsPerSegment option (deprecated in 1.5).

Known Issues and Limitations

Known issues and limitations related to the 2.x release of the Pivotal Greenplum-Spark Connector include the following:

  • (Resolved in 2.1.0) The Connector cannot use gp_segment_id as the partitionColumn (the default) when reading data from Greenplum Database and mirroring is enabled in the Greenplum cluster.
  • The Connector does not support reading from or writing to Greenplum Database when your Spark cluster is deployed on Kubernetes.
  • The Connector supports basic data types like Float, Integer, String, and Date/Time data types. The Connector does not yet support more complex types. See Greenplum Database <-> Spark Data Type Mapping for additional information.