Pivotal Greenplum-Spark Connector 1.2.0 Release Notes

The Pivotal Greenplum-Spark Connector supports high speed, parallel data transfer from Greenplum Database to an Apache Spark cluster.

Pivotal Greenplum-Spark Connector 1.2.0 is a minor release of the Greenplum Database connector for Apache Spark. This release includes new features and improvements.

Scope

The Greenplum-Spark Connector supports loading Greenplum Database table data into Spark using:

  • Spark’s Scala API - programmatic access (including the spark-shell REPL)

Supported Platforms

The following table identifies the supported component versions for Pivotal Greenplum-Spark Connector 1.2.0:

Greenplum-Spark Connector Version Greenplum Version Spark Version Scala Version
1.2.0 4.3.x, 5.x 2.1.1 2.11
1.1.0 4.3.x, 5.x 2.1.1 2.11
1.0.0 4.3.x, 5.x 2.1.1 2.11

Refer to the Pivotal Greenplum Database documentation for detailed information on Pivotal Greenplum Database.

See the Apache Spark documentation for information on Apache Spark version 2.1.1.

New Features

Pivotal Greenplum-Spark Connector 1.2.0 includes the following new features:

  • Filter Predicate Pushdown

    The Greenplum-Spark connector now supports filter pushdown when reading from Greenplum Database into Spark. The filter is applied by Greenplum Database, and the Connector transfers the filtered table data to Spark.

  • User-Specified Schema

    The Greenplum-Spark Connector now exposes a schema option to identify the location of the Greenplum Database table. The table need no longer reside in a schema in your search_path.

  • Custom JDBC Driver

    You can now use a custom JDBC driver with the Greenplum-Spark Connector. Refer to Constructing the Greenplum Database JDBC URL.

  • JDBC Connection Pooling

    The Greenplum-Spark connector now uses JDBC connection pooling internally to optimize connection re-use.

Changes

Pivotal Greenplum-Spark Connector 1.2.0 includes the following changes:

  • Data Source Short Name

    The Greenplum-Spark connector now exposes the data source short name greenplum for reading data from Greenplum Database. Use of the Greenplum-Spark Connector fully-qualified data source class name is deprecated. Refer to Greenplum-Spark Connector Data Source for additional information.

  • Location of External Table Creation

    When you provide a user-specified schema, the Greenplum-Spark Connector now creates external tables in that schema rather than the public schema.

  • Port Usage

    In previous releases, the Greenplum-Spark Connector utilized multiple TCP ports in the range 49152-65535 for transferring data from Greenplum Database segment hosts to Spark worker nodes. The Greenplum-Spark Connector now uses a single port for data transfer and defers port assigment to the operating system unless you specifically configure the port number that you want the Connector to use. Refer to Network Port Requirements for more information about Greenplum-Spark Connector port requirements and configuration.

  • Removed Greenplum Database SUPERUSER Requirement

    The Greenplum-Spark Connector no longer requires SUPERUSER privileges for the Greenplum Database user specified in the JDBC login credentials.

  • Connector password Key is now Optional

    The GreenplumRelationProvider password connection key is now optional. You can omit the password key if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. See Connector Read Options.

Resolved Issues

The following issues were resolved in Pivotal Greenplum-Spark Connector 1.2.0:

Bug Id Summary
154978014 If a Greenplum Database table contained a column of time or timestamp type and one of the column’s values specified fractional seconds (for example, 10:36:54.137), then the Greenplum-Spark connector would issue a warning similar to: java.lang.NumberFormatException: For input string: “43.553”. This problem has been resolved.

Known Issues and Limitations

Known issues and limitations related to the 1.2.0 release of the Pivotal Greenplum-Spark Connector include the following:

  • The Connector does not yet support writing Spark data back into Greenplum Database.
  • The Greenplum-Spark Connector supports basic data types like Float, Integer, String, and Date/Time data types. The Connector does not yet support more complex types. See Greenplum Database <-> Spark Data Type Mapping for additional information.