Writing from Spark into Greenplum Database
Writing a Spark DataFrame
into a Greenplum Database table loads each Row
in the DataFrame
into the table. You can use the Spark Scala API or the spark-shell
interactive shell to write Spark data to a Greenplum Database table that you created with the CREATE TABLE
SQL command.
The Greenplum-Spark Connector provides a Spark data source optimized for writing Spark data into Greenplum Database data. To write to a Greenplum Database table, you must identify the Greenplum-Spark Connector data source name and provide write options for the export.
Greenplum-Spark Connector Data Source
A Spark data source provides an access point to structured data. Spark provides several pre-defined data sources to support specific file types and databases. You specify a Spark data source using either its fully qualified name or its short name.
The Greenplum-Spark Connector exposes a Spark data source named greenplum
to transfer data between Spark and Greenplum Database. The Greenplum-Spark Connector supports specifying the data source only with this short name.
Use the .format(datasource: String)
Scala method to identify the data source. You must provide the Greenplum-Spark Connector data source short name greenplum
to the .format()
method. For example:
dfToWrite.write.format("greenplum")
Connector Write Options
The greenplum
data source supports the write options identified in the table below. An option is required unless otherwise specified.
Option Key | Value Description |
---|---|
url | The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL. |
dbschema | The name of the Greenplum Database schema in which dbtable resides. This option also identifies the name of the schema in which the Greenplum-Spark Connector creates temporary external tables. Optional, the default schema is the schema named public . |
dbtable | The name of the Greenplum Database table. The Greenplum-Spark Connector creates this table in the schema named dbschema if it does not exist. |
driver | The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver. |
user | The Greenplum Database user/role name. |
password | The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. Optional. |
truncate | The table overwrite mode. Governs the actions of the Connector when you specify SaveMode.Overwrite and the target Greenplum Database table exists. Optional. The default value is false ; the Connector drops and then re-creates the target table before it writes any data. When true , the Connector truncates the target table before writing data. |
Writing to Greenplum Database
When you write a Spark DataFrame
to a Greenplum Database table, you identify the Greenplum-Spark Connector data source, provide the write options, and invoke the DataFrameWriter.save()
method on the Spark DataFrame
that you want to write. For example:
val gscWriteOptionMap = Map(
"url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
"user" -> "bill",
"password" -> "changeme",
"dbschema" -> "myschema",
"dbtable" -> "table2",
)
dfToWrite.write.format("greenplum")
.options(gscWriteOptionMap)
.save()
You can specify the behaviour of the Connector when the Greenplum Database table already exists, or when the table already contains data. You provide this information with the DataFrameWriter.mode(SaveMode savemode)
method.
The Greenplum-Spark Connector supports the following Spark SaveMode
settings:
SaveMode | Behaviour |
---|---|
ErrorIfExists |
The Connector returns an error if the Greenplum Database table already exists. This is the default value. |
Append |
The Connector appends the Spark data to data that may already exist in the Greenplum Database table. |
Ignore |
If the Greenplum Database table already exists, the Connector ignores the write request; it neither writes data to the table nor does it disturb the existing data. |
Overwrite |
If the Greenplum Database table already exists, the truncate option value governs whether the Connector drops and recreates the target table (default, truncate is false ), or truncates the target table (truncate is true ) before writing the data. Note: The Greenplum-Spark Connector cannot truncate a Greenplum Database table if it includes any foreign key constraints. |
You must enter import org.apache.spark.sql.SaveMode
in your Spark Scala application or the SaveMode
. For example, to specify the Append
mode on write in a Scala application:
import org.apache.spark.sql.SaveMode
dfToWrite.write.format("greenplum")
.options(gscWriteOptionMap)
.mode(SaveMode.Append)
.save()
To specify the Append
mode on a write in a PySpark application, provide the mode text string:
.mode("Append")
When you call save()
on a Spark DataFrame
to write to a Greenplum Database table, the Greenplum-Spark Connector either saves the entire DataFrame
to the table, or aborts the write operation.
Warning: If the Greenplum-Spark Connector encounters an error during .save()
, the target Greenplum table may be left in an inconsistent state.
Table Creation Modes
When you write Spark data to a Greenplum Database table, the Greenplum-Spark Connector creates the table for you if it does not already exist. Alternatively, you can choose to pre-create the target table.
When the Greenplum-Spark Connector creates the Greenplum Database table for you, the Connector:
- Creates the Greenplum Database table without specifying a distribution column.
Creates the Greenplum table using the column names defined in the Spark
DataFrame
(for example,"flt_month"
). Note that the Spark column name reflects any transformations you may have performed on the Spark data. For example, if you used theavg()
method to average the data in a column nameddepdelayminutes
, the Spark column name is literallyavg(depdelayminutes)
.You can use the
.as()
method to provide a different name for the column. For example:.avg("depdelayminutes").as("depdelaymins")
.Specifies the column names in double-quotes. This renders the column names case-sensitive to Greenplum Database.
Creates the Geenplum table column with the
NOT NULL
clause when the SparkDataFrame
column is notnullable
.Creates the table columns in the order that they are defined in the Spark
DataFrame
.Maps the data type of each Spark column to the equivalent Greenplum data type.
If you choose to pre-create the target Greenplum Database table, take the following into consideration:
- Use Greenplum table columns names that match the column names defined in the Spark
DataFrame
that you write. Keep in mind that the Greenplum-Spark Connector treats column names as case-sensitive. - Select a data type for each Greenplum table column that is equivalent to the data type of the related Spark
DataFrame
column. Refer to Spark to Greenplum Data Type Mapping for data type mapping information. - If a Spark data column contains NULL values, ensure that you do not specify the
NOT NULL
clause when you create the associated Greenplum Database table column. - You may create the Greenplum Database table with a subset of the columns in the Spark
DataFrame
, or with a different column ordering. The Greenplum table must not include columns that are not present in the SparkDataFrame
. - Ensure that you assign the Greenplum Database user/role the permissions required to write to the table. The user may also require permission to (re-)create the table. Refer to Role Privileges for information about configuring Greenplum Database privileges.