About Connector Options
You provide Greenplum Database connection options and the read/write options required by the Greenplum-Spark Connector via generic key-value String
pairs.
The Greenplum-Spark Connector supports the options identified in the table below. An option is required unless otherwise specified.
Option Key | Value Description | Operation |
---|---|---|
url | The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL. | Read, Write |
dbschema | The name of the Greenplum Database schema in which dbtable resides. This option also identifies the name of the schema in which the Greenplum-Spark Connector creates temporary external tables. Optional, the default schema is the schema named public . |
Read, Write |
dbtable | The name of the Greenplum Database table. When reading from Greenplum Database, this table must reside in the Greenplum Database schema identified in the dbschema option value. When writing from Spark to Greenplum Database, the Greenplum-Spark Connector creates the table in dbschema if the table does not already exist. |
Read, Write |
driver | The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver. | Read, Write |
user | The Greenplum Database user/role name. | Read, Write |
password | The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. Optional. | Read, Write |
partitionColumn | The name of the Greenplum Database table column to use for Spark partitioning. This column must be one of the Greenplum Database data types integer , bigint , serial , or bigserial . The column name must be lowercase. |
Read |
partitionsPerSegment | The number of Spark partitions per Greenplum Database segment. Optional, the default value is 1 partition. | Read |
connector.port | The port number, or the environment variable name that identifies such, that the Greenplum-Spark Connector should use for data transfer to/from Spark worker nodes. Optional, by default the Connector uses a random port. | Read, Write |
pool.maxSize | The maximum number of connections in the connection pool. Optional, the default value is 64. | Read, Write |
pool.timeoutMs | The amount of time, in milliseconds, after which an inactive connection is considered idle. Optional, the default value is 10,000 (10 seconds). | Read, Write |
pool.minIdle | The minimum number of idle connections maintained in the connection pool. Optional, the default value is 0. | Read, Write |
You can specify options to the Connector individually or in an options Map
. The class methods of interest for setting Greenplum-Spark Connector options are:
.option(key: String, value: String)
for specifying an individual option, and:
.options(options: Map[String, String])
for specifying an options map.
To specify an option individually, provide <option_key> and <value> strings to the DataFrameReader.option()
method. For example, to specify the user
option:
.option("user", "gpdb_role_name")
To construct a scala.collection.Map
comprising more than one option, you provide the <option_key> and <value> strings for each option. For example, to create the options for a read operation:
val gscReadOptionMap = Map(
"url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
"user" -> "bill",
"password" -> "changeme",
"dbschema" -> "myschema",
"dbtable" -> "table1",
"partitionColumn" -> "id"
)
To provide an options map to the Connector, specify it in the DataFrameReader.options()
or DataFrameWriter.options()
methods. For example, to provide the gscReadOptionMap
map created above:
.options(gscReadOptionMap)
Specifying Partition Options
Partition options apply only when you read data from Greenplum Database into Spark.
A Spark DataFrame
is a fault-tolerant collection of elements partitioned across the Spark cluster nodes. Spark operates on these elements in parallel.
Greenplum Database distributes its table data across segments running on segment hosts.
The Greenplum-Spark connector provides two options to configure the mapping between Spark partitions and Greenplum Database segment data, partitionColumn
and partitionsPerSegment
.
partitionColumn
The partitionColumn
option that you specify must have the integer
, bigint
, serial
, or bigserial
Greenplum Database data type. The partitionColumn
you identify need not be the column specified with the DISTRIBUTED BY (<column>)
clause when you created the Greenplum Database table.
partitionsPerSegment
By default, the Greenplum-Spark Connector creates one Spark partition per Greenplum Database segment. You can set the partitionsPerSegment
option to specify a larger number of Spark partitions.
Spark partitions have a 2 GB size limit. If you are using the Connector to move more than 2 GB of data per Greenplum Database segment, you must increase the partitionsPerSegment
option value appropriately.
Specifying the connector.port Option
Use the connector.port
option to specify the port number that the Greenplum-Spark Connector uses for data transfer between Greenplum Database and Spark worker nodes.
Option Key | Value Description |
---|---|
connector.port | The port number or the environment variable name that identifies such, that the Greenplum-Spark Connector should use for data transfer to/from Spark worker nodes. Optional, by default the Connector uses a random port. |
You can specify a port number or an environment variable name for the connector.port
option. To set a port number, identify the specific port number. For example, to set the connector.port
as a single option:
.option("connector.port", "12900")
Or, to set within a Map
of options:
"connector.port" -> "12900"
If you choose to specify the port number with an environment variable, prefix the environment variable name with env.
. For example, to identify the environment variable named GSC_EXTERNAL_PORT
as the connector.port
option value:
"connector.port" -> "env.GSC_EXTERNAL_PORT",
Note: Setting connector.port
to env.GPFDIST_PORT
results in the same behavior as that of Greenplum-Spark Connector version 1.2.0.
Refer to Configuring Spark Worker Ports for additional information about the connector.port
option.
Specifying Connection Pool Options
The Greenplum-Spark Connector provides connection pool configuration options. These options are named with the pool.
prefix:
Option Key | Value Description |
---|---|
pool.maxSize | The maximum number of connections in the connection pool. Optional, the default value is 64. |
pool.timeoutMs | The amount of time, in milliseconds, after which an inactive connection is considered idle. Optional, the default value is 10,000 (10 seconds). |
pool.minIdle | The minimum number of idle connections maintained in the connection pool. Optional, the default value is 0. |
To set each connection pool option as a single option:
.option("pool.maxSize", "50")
.option("pool.minIdle", "5")
.option("pool.timeoutMs", "7500")
The first DataFrame
that you create with a specific connection string URL, username, and password combination defines the configuration of that connection pool. The Greenplum-Spark Connector ignores connection pool options specified on subsequent DataFrame
s created with the same URL/username/password combination.
Refer to JDBC Connection Pooling for additional information about connection pool configuration options.