Using the Greenplum-Spark Connector


Before using the Greenplum-Spark Connector, ensure that you can identify:

  • The hostname of your Greenplum Database master node.
  • The port on which your Greenplum Database master server process is running, if it is not running on the default port (5432).
  • The name of the Greenplum database to which you want to connect.
  • The name of the Greenplum Database table you want to access.
  • The Greenplum Database user/role name and password that you have been assigned. This role must have Greenplum Database SUPERUSER administrative privileges.

Downloading the Connector JAR File

The Greenplum-Spark Connector is available as a separate download for Greenplum Database 4.3.X from Pivotal Network:

  1. Download the JAR file by navigating to Pivotal Network and locating and selecting the Release Download directory named Pivotal Greenplum Connector.

    The format of the Greenplum-Spark Connector JAR file name is greenplum-spark_<spark-version>-<gsc-version>.jar. For example:

  2. Make note of the directory to which the JAR was downloaded.

Using spark-shell

You can run Spark interactively through spark-shell, a modified version of the Scala shell. Refer to the spark-shell Spark documentation for detailed information on using this command.

To try out the Greenplum-Spark Connector, run the spark-shell command providing a --jars option that identifies the file system path to the Greenplum-Spark Connector JAR file. For example:

spark-user@spark-node$ export GSC_JAR=/path/to/greenplum-spark_<spark-version>-<version>.jar
spark-user@spark-node$ spark-shell --jars $GSC_JAR
< ... spark-shell startup output messages ... >

When you run spark-shell, you enter the scala> interactive subsystem. A SparkSession is instantiated for you and accessible via the spark local variable:

scala> println(spark)

Your SparkSession provides the entry point to the method that you will use to transfer data from a Greenplum Database table into Spark.

Constructing the Greenplum Database JDBC URL

The Greenplum-Spark Connector uses a JDBC connection to communicate with the Greenplum Database master node. The PostgreSQL JDBC driver JAR file is bundled with the Greenplum-Spark Connector JAR file, so you do not need to manage this dependency.

You must provide a JDBC connection URL when you use the Connector to transfer data between Greenplum Database and Spark. The Greenplum-Spark Connector JDBC connection URL format is:

Parameter Name Description
<master> Hostname or IP address of the Greenplum Database master node.
<port> The port on which the Greenplum Database server process is listening. Optional, default is 5432.
<database_name> The Greenplum database to which you want to connect.

For example:


The syntax and semantics of the JDBC connection string URL are governed by the PostgreSQL JDBC driver. For additional information about this syntax, refer to Connecting to the Database in the PostgreSQL JDBC documentation.

Note: Even though PostgreSQL supports specifying the user name and password in the JDBC connection string, the Greenplum-Spark Connector requires that these connection options be provided separately.

Developing Applications with the Connector

If you are writing a stand-alone Spark application, you will bundle the Greenplum-Spark connector along with your other application dependencies into an “uber” JAR. The Spark Self-Contained Applications and Bundling Your Application’s Dependencies documentation identifies additional considerations for stand-alone Spark application development.

You can use the spark-submit command to launch a Spark application assembled with the Greenplum-Spark Connector. You can also run the spark-submit command providing a --jars option that identifies the file system path to the Greenplum-Spark Connector JAR file. The spark-submit Spark documentation describes using this command.