Using the Greenplum-Spark Connector
Before using the Greenplum-Spark Connector, ensure that you can identify:
- The hostname of your Greenplum Database master node.
- The port on which your Greenplum Database master server process is running, if it is not running on the default port (5432).
- The name of the Greenplum database to which you want to connect.
- The names of the Greenplum Database schema and table that you want to access.
- The Greenplum Database user/role name and password that you have been assigned. Also ensure that this user/role name has the required privileges as described in Configuring Greenplum Database Role Privileges.
The Greenplum-Spark Connector is available as a separate download for Greenplum Database 4.3.X or 5.X from Pivotal Network:
Download the JAR file by navigating to Pivotal Network and locating and selecting the Release Download directory named Pivotal Greenplum Connector.
The format of the Greenplum-Spark Connector JAR file name is
greenplum-spark_<spark-version>-<gsc-version>.jar. For example:
Make note of the directory to which the JAR was downloaded.
You can run Spark interactively through
spark-shell, a modified version of the Scala shell. Refer to the spark-shell Spark documentation for detailed information on using this command.
To try out the Greenplum-Spark Connector, run the
spark-shell command providing a
--jars option that identifies the file system path to the Greenplum-Spark Connector JAR file. For example:
spark-user@spark-node$ export GSC_JAR=/path/to/greenplum-spark_<spark-version>-<version>.jar spark-user@spark-node$ spark-shell --jars $GSC_JAR < ... spark-shell startup output messages ... > scala>
When you run
spark-shell, you enter the
scala> interactive subsystem. A
SparkSession is instantiated for you and accessible via the
spark local variable:
scala> println(spark) org.apache.spark.sql.SparkSession@4113d9ab
SparkSession provides the entry point to the
spark.read.format().load() method that you will use to transfer data from a Greenplum Database table into Spark.
If you are writing a stand-alone Spark application, you will bundle the Greenplum-Spark Connector along with your other application dependencies into an “uber” JAR. The Spark Self-Contained Applications and Bundling Your Application’s Dependencies documentation identifies additional considerations for stand-alone Spark application development.
You can use the
spark-submit command to launch a Spark application assembled with the Greenplum-Spark Connector. You can also run the
spark-submit command providing a
--jars option that identifies the file system path to the Greenplum-Spark Connector JAR file. The spark-submit Spark documentation describes using this command.
The Greenplum-Spark Connector uses a JDBC connection to communicate with the Greenplum Database master node. The PostgreSQL JDBC driver JAR file is bundled with the Greenplum-Spark Connector JAR file, so you do not need to manage this dependency. You may also use a custom JDBC driver with the Greenplum-Spark Connector.
You must provide a JDBC connection string URL when you use the Connector to transfer data between Greenplum Database and Spark. This URL must include the Greenplum Database master hostname and port, as well as the name of the database to which you want to connect.
|<master>||Hostname or IP address of the Greenplum Database master node.|
|<port>||The port on which the Greenplum Database server process is listening. Optional, default is 5432.|
|<database_name>||The Greenplum database to which you want to connect.|
Note: The Greenplum-Spark Connector requires that other connection options, including user name and password, be provided separately.
The JDBC connection string URL format for the default Greenplum-Spark Connector JDBC driver is:
The syntax and semantics of the default JDBC connection string URL are governed by the PostgreSQL JDBC driver. For additional information about this syntax, refer to Connecting to the Database in the PostgreSQL JDBC documentation.
The Greenplum-Spark Connector also supports using a custom JDBC driver. To use a custom Greenplum Database JDBC driver, you must:
Construct a JDBC connection string URL for your custom driver that includes the Greenplum Database master hostname and port and the name of the database to which you want to connect.
Provide the JAR file for the custom JDBC driver via one of the following options:
- Include a
--jars <custom-jdbc-driver>.jaroption on your
spark-submitcommand line, identifying the full path to the custom JDBC driver JAR file.
- Bundle the Greenplum-Spark Connector and custom JDBC JAR files along with your other application dependencies into an “uber” JAR.
- Install the custom JDBC JAR file in a known, configured location on your Spark executor nodes.
- Include a
You must also identify the fully qualified Java class name of the JDBC driver in a Greenplum-Spark Connector option (described in Connector Read Options).
For example, to use the Greenplum-Spark Connector with the Greenplum Database Data Direct JDBC driver that you have downloaded to
/tmp/greenplum.jar in a
spark-shell session, use this connection string URL format:
spark-shell with the following command line:
$ spark-shell --jars /tmp/greenplum.jar --jars $GSC_JAR
And specify the class name
com.pivotal.jdbc.GreenplumDriver in the Greenplum-Spark Connector JDBC
driver option value.
The Greenplum-Spark Connector pools JDBC connections for each Spark application. The Connector creates a new connection pool for each unique combination of JDBC connection string URL, username, and password.
You can use Greenplum-Spark Connector options to configure the size of the connection pool (
pool.maxSize), the amount of time after which an inactive connection is considered idle (
pool.timeoutMs), and the minimum number of idle connections allowed in the pool (
pool.minIdle). These connection pool options bound the number of open connections between a Spark application and the Greenplum Database server.
Setting connection pool options in your Spark application is described in Specifying Connection Pool Options.
Take into consideration both the performance required for your Spark application and the desired resource impact on the Greenplum Database cluster when you change connection pool configuration:
Decreasing the maximum size of the connection pool bounds the number of open connections to the Greenplum Database server. Setting this option too low may decrease the parallelism of your Spark application.
The default minimum number of idle connections in the pool is zero (0). If you increase this value, be aware that the Connector maintains at least that number of open connections, and that some or all of the connections may be idle.
If you decrease the idle timeout for connections in the pool, the Connector closes idle connections sooner. Very short timeout values may defeat the purpose of connection pooling.
The Greenplum Database
max_connections server configuration parameter identifies the maximum number of concurrent open connections that are permitted to the database server. Each running Spark application that uses the Greenplum-Spark Connector owns a number of open connections on the Greenplum Database server. Greenplum Database Connection Errors provides troubleshooting information should you encounter Greenplum Database connection errors in your Spark application.
By default, the Greenplum-Spark Connector selects a random port for data transfer between Greenplum Database and Spark worker nodes. If you choose, you can specify the port number that the Connector uses for data transfer via an option named
Setting and using the
connector.port option in your Spark application is described in Specifying the connector.port Option.
connector.port to a single port number or to an environment variable name of your choosing, where the environment variable value is set to a single port number.
The table below describes the actions of the Greenplum-Spark Connector for each type of
|Single port number||The Greenplum-Spark Connector attempts to open/use the specified port number on each Spark worker.|
|Environment variable name||The Greenplum-Spark Connector reads the value of the environment variable and attempts to open the port number. When you use an environment variable to set the port number, you can configure a different port number for each Spark worker node.|
If you choose to specify the port number via an environment variable, set the environment variable on each Spark worker before running the
start-slave.sh command on that node. For example, if the environment variable is named
user@spark-worker$ GSC_EXTERNAL_PORT="12900" start-slave.sh
You may also choose to set an environment variable in your
spark-env.sh file. For information about the Spark
spark-env.sh file, refer to the Environment Variables section of the Spark Configuration documentation.