LATEST VERSION: 1.6 - RELEASE NOTES
Pivotal Greenplum®-Spark® Connector v1.6

About Connector Options

You provide Greenplum Database connection options and the read/write options required by the Greenplum-Spark Connector via generic key-value String pairs.

Greenplum-Spark Connector Options

The Greenplum-Spark Connector supports the options identified in the table below. An option is required unless otherwise specified.

Option Key Value Description Operation
url The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL. Read, Write
dbschema The name of the Greenplum Database schema in which dbtable resides. This option also identifies the name of the schema in which the Greenplum-Spark Connector creates temporary external tables. Optional, the default schema is the schema named public. Read, Write
dbtable The name of the Greenplum Database table. When reading from Greenplum Database, this table or view must reside in the Greenplum Database schema identified in the dbschema option value. When writing from Spark to Greenplum Database, the Greenplum-Spark Connector creates the table in dbschema if the table does not already exist. Read, Write
driver The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver. Read, Write
user The Greenplum Database user/role name. Read, Write
password The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. Optional. Read, Write
partitionColumn The name of the Greenplum Database table column to use for Spark partitioning. This column must be one of the Greenplum Database data types integer, bigint, serial, or bigserial. The column name must be lowercase. Read
partitions The number of Spark partitions. Optional, the default value is the number of primary segments in the Greenplum Database cluster. Read
partitionsPerSegment The number of Spark partitions per Greenplum Database segment. This option is ignored in favor of partitions. Note: The partitionsPerSegment option is deprecated as of version 1.5.0 and will be removed in a future release. Read
truncate The table overwrite mode. Governs the actions of the Connector when you specify SaveMode.Overwrite on a write operation and the target Greenplum Database table exists. Optional. The default value is false; the Connector drops and then re-creates the target table before it writes any data. When true, the Connector truncates the target table before writing any data. Write
connector.port The gpfdist server process port number, or the environment variable name that identifies such, on the Spark worker node. This option is ignored in favor of server.port. Note: The connector.port option is deprecated as of version 1.6.0 and will be removed in a future release. Read, Write
server.port The gpfdist server process port number, or the environment variable name that identifies such, on the Spark worker node. Optional, by default the Connector uses a random port. Read, Write
server.useHostname Use the Spark worker node host name to specify the gpfdist server address. Optional, the default value is false. Read, Write
server.hostEnv The name of the environment variable whose value identifies the Spark worker node hostname or IP address on which to start the gpfdist server process. Optional. Read, Write
server.nic The name of the Spark worker node network interface on which to start the gpfdist server, or the environment variable name that identifies such. Optional. Read, Write
pool.maxSize The maximum number of connections in the connection pool. Optional, the default value is 64. Read, Write
pool.timeoutMs The amount of time, in milliseconds, after which an inactive connection is considered idle. Optional, the default value is 10,000 (10 seconds). Read, Write
pool.minIdle The minimum number of idle connections maintained in the connection pool. Optional, the default value is 0. Read, Write

You can specify options to the Connector individually or in an options Map. The class methods of interest for setting Greenplum-Spark Connector options are:

.option(key: String, value: String)

for specifying an individual option, and:

.options(options: Map[String, String])

for specifying an options map.

To specify an option individually, provide <option_key> and <value> strings to the DataFrameReader.option() method. For example, to specify the user option:

.option("user", "gpdb_role_name")

To construct a scala.collection.Map comprising more than one option, you provide the <option_key> and <value> strings for each option. For example, to create the options for a read operation:

val gscReadOptionMap = Map(
      "url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
      "user" -> "bill",
      "password" -> "changeme",
      "dbschema" -> "myschema",
      "dbtable" -> "table1",
      "partitionColumn" -> "id"
)

To provide an options map to the Connector, specify it in the DataFrameReader.options() or DataFrameWriter.options() methods. For example, to provide the gscReadOptionMap map created above:

.options(gscReadOptionMap)

Specifying Partition Options

Partition options apply only when you read data from Greenplum Database into Spark.

A Spark DataFrame is a fault-tolerant collection of elements partitioned across the Spark cluster nodes. Spark operates on these elements in parallel.

Greenplum Database distributes its table data across segments running on segment hosts.

The Greenplum-Spark Connector provides two options to configure the mapping between Spark partitions and Greenplum Database segment data, partitionColumn and partitions.

partitionColumn

The partitionColumn option that you specify must have the integer, bigint, serial, or bigserial Greenplum Database data type. The partitionColumn you identify need not be the column you may have specified with the DISTRIBUTED BY (<column>) clause when you created the Greenplum Database table.

partitions

By default, the Greenplum-Spark Connector creates one Spark partition per Greenplum Database segment. You can set the partitions option to specify a larger number of Spark partitions.

Specifying the Connector Server Address Options

The Greenplum-Spark Connector uses the Greenplum Database gpfdist protocol to transfer data between Spark and Greenplum Database.

Use one or more of the following options to specify the address and port number of the gpfdist server running on a Spark worker node:

  • server.port
  • server.useHostname
  • server.hostEnv
  • server.nic

Refer to Configuring the Connector Server Address for additional information about these options.

server.port

You can specify a port number or an environment variable name for the server.port option. To set a port number, identify the specific port number. For example, to set the server.port as a single option:

.option("server.port", "12900")

Or, to set within a Map of options:

"server.port" -> "12900"

If you choose to specify the port number with an environment variable, prefix the environment variable name with env.. For example, to identify the environment variable named GSC_EXTERNAL_PORT as the server.port option value:

"server.port" -> "env.GSC_EXTERNAL_PORT",

Note: Setting server.port to env.GPFDIST_PORT results in the same behavior as that of Greenplum-Spark Connector version 1.2.0.

server.useHostname

server.useHostname is a boolean option that identifies whether to use the Spark worker host name or the IP address to identify the gpfdist server on that node. When set to true, the Greenplum-Spark Connector uses the Spark worker host name as the address of the gpfdist server.

The default server.useHostname value is false. When false and neither server.hostEnv nor server.nic are set (the default), the Greenplum-Spark Connector uses the Spark worker IP address to address the gpfdist server.

To set server.useHostname as a single option:

"server.useHostname" -> "true"

server.hostEnv

server.hostEnv specifies the name of the environment variable whose value identifies the Spark worker node host name or IP address on which to start the gpfdist server. This environment variable will typically identify the same information as the Spark SPARK_LOCAL_IP environment variable.

The Connector examines the server.hostEnv option only when server.useHostname is set to false.

To set server.hostEnv as a single option:

"server.hostEnv" -> "env.MY_LOCAL_SPARK_IP"

server.nic

server.nic identifies the name of the Spark worker node network interface on which to start the gpfdist server, or the environment variable name that identifies such.

The Connector examines this option only when server.useHostname is false and no server.hostEnv is specified.

To directly set server.nic within a Map of options:

"server.nic" -> "eth1"

Specifying Connection Pool Options

The Greenplum-Spark Connector provides connection pool configuration options. These options are named with the pool. prefix:

Option Key Value Description
pool.maxSize The maximum number of connections in the connection pool. Optional, the default value is 64.
pool.timeoutMs The amount of time, in milliseconds, after which an inactive connection is considered idle. Optional, the default value is 10,000 (10 seconds).
pool.minIdle The minimum number of idle connections maintained in the connection pool. Optional, the default value is 0.

To set each connection pool option as a single option:

.option("pool.maxSize", "50")
.option("pool.minIdle", "5")
.option("pool.timeoutMs", "7500")

The first DataFrame that you create with a specific connection string URL, username, and password combination defines the configuration of that connection pool. The Greenplum-Spark Connector ignores connection pool options specified on subsequent DataFrames created with the same URL/username/password combination.

Refer to JDBC Connection Pooling for additional information about connection pool configuration options.