Writing from Spark into Greenplum Database

Writing a Spark DataFrame into a Greenplum Database table loads each Row in the DataFrame into the table. You can use the Spark Scala API or the spark-shell interactive shell to write Spark data to a Greenplum Database table that you created with the CREATE TABLE SQL command.

The Greenplum-Spark Connector provides a Spark data source optimized for writing Spark data into Greenplum Database data. To write to a Greenplum Database table, you must identify the Greenplum-Spark Connector data source name and provide write options for the export.

Greenplum-Spark Connector Data Source

A Spark data source provides an access point to structured data. Spark provides several pre-defined data sources to support specific file types and databases. You specify a Spark data source using either its fully qualified name or its short name.

The Greenplum-Spark Connector exposes a Spark data source named greenplum to transfer data between Spark and Greenplum Database. The Greenplum-Spark Connector supports specifying the data source only with this short name.

Use the .format(datasource: String) Scala method to identify the data source. You must provide the Greenplum-Spark Connector data source short name greenplum to the .format() method. For example:

dfToWrite.write.format("greenplum")

Connector Write Options

The greenplum data source supports the write options identified in the table below. An option is required unless otherwise specified.

Option Key Value Description
url The JDBC connection string URL; see Constructing the Greenplum Database JDBC URL.
dbschema The name of the Greenplum Database schema in which dbtable resides. This option also identifies the name of the schema in which the Greenplum-Spark Connector creates temporary external tables. Optional, the default schema is the schema named public.
dbtable The name of the Greenplum Database table. The Greenplum-Spark Connector creates this table in the schema named dbschema if it does not exist.
driver The fully qualified class path of the custom JDBC driver. Optional, specify only when using a custom JDBC driver.
user The Greenplum Database user/role name.
password The Greenplum Database password for the user. You can omit the password if Greenplum Database is configured to not require a password for the specified user, or if you use kerberos authentication and provide the required authentication properties in the JDBC connection string URL. Optional.
truncate The table overwrite mode. Governs the actions of the Connector when you specify SaveMode.Overwrite and the target Greenplum Database table exists. Optional. The default value is false; the Connector drops and then re-creates the target table before it writes any data. When true, the Connector truncates the target table before writing data.

Writing to Greenplum Database

When you write a Spark DataFrame to a Greenplum Database table, you identify the Greenplum-Spark Connector data source, provide the write options, and invoke the DataFrameWriter.save() method on the Spark DataFrame that you want to write. For example:

val gscWriteOptionMap = Map(
      "url" -> "jdbc:postgresql://gpdb-master:5432/testdb",
      "user" -> "bill",
      "password" -> "changeme",
      "dbschema" -> "myschema",
      "dbtable" -> "table2",
)

dfToWrite.write.format("greenplum")
      .options(gscWriteOptionMap)
      .save()

You can specify the behaviour of the Connector when the Greenplum Database table already exists, or when the table already contains data. You provide this information with the DataFrameWriter.mode(SaveMode savemode) method.

The Greenplum-Spark Connector supports the following Spark SaveMode settings:

SaveMode Behaviour
ErrorIfExists The Connector returns an error if the Greenplum Database table already exists. This is the default value.
Append The Connector appends the Spark data to data that may already exist in the Greenplum Database table.
Ignore If the Greenplum Database table already exists, the Connector ignores the write request; it neither writes data to the table nor does it disturb the existing data.
Overwrite If the Greenplum Database table already exists, the truncate option value governs whether the Connector drops and recreates the target table (default, truncate is false), or truncates the target table (truncate is true) before writing the data. Note: The Greenplum-Spark Connector cannot truncate a Greenplum Database table if it includes any foreign key constraints.

You must enter import org.apache.spark.sql.SaveMode in your Spark application or the spark-shell when you specify a SaveMode. For example, to specify the Append mode on write:

import org.apache.spark.sql.SaveMode

dfToWrite.write.format("greenplum")
      .options(gscWriteOptionMap)
      .mode(SaveMode.Append)
      .save()

When you call save() on a Spark DataFrame to write to a Greenplum Database table, the Greenplum-Spark Connector either saves the entire DataFrame to the table, or aborts the write operation.

Warning: If the Greenplum-Spark Connector encounters an error during .save(), the target Greenplum table may be left in an inconsistent state.

Table Creation Modes

When you write Spark data to a Greenplum Database table, the Greenplum-Spark Connector creates the table for you if it does not already exist. Alternatively, you can choose to pre-create the target table.

When the Greenplum-Spark Connector creates the Greenplum Database table for you, the Connector:

  • Creates the Greenplum Database table without specifying a distribution column.
  • Creates the Greenplum table using the column names defined in the Spark DataFrame (for example, "flt_month"). Note that the Spark column name reflects any transformations you may have performed on the Spark data. For example, if you used the avg() method to average the data in a column named depdelayminutes, the Spark column name is literally avg(depdelayminutes).
  • Specifies the column names in double-quotes. This renders the column names case-sensitive to Greenplum Database.
  • Creates the Geenplum table column with the NOT NULL clause when the Spark DataFrame column is not nullable.
  • Creates the table columns in the order that they are defined in the Spark DataFrame.
  • Maps the data type of each Spark column to the equivalent Greenplum data type.

If you choose to pre-create the target Greenplum Database table, take the following into consideration:

  • Use Greenplum table columns names that match the column names defined in the Spark DataFrame that you write. Keep in mind that the Greenplum-Spark Connector treats column names as case-sensitive.
  • Select a data type for each Greenplum table column that is equivalent to the data type of the related Spark DataFrame column. Refer to Spark to Greenplum Data Type Mapping for data type mapping information.
  • If a Spark data column contains NULL values, ensure that you do not specify the NOT NULL clause when you create the associated Greenplum Database table column.
  • You may create the Greenplum Database table with a subset of the columns in the Spark DataFrame, or with a different column ordering. The Greenplum table must not include columns that are not present in the Spark DataFrame.
  • Ensure that you assign the Greenplum Database user/role the permissions required to write to the table. The user may also require permission to (re-)create the table. Refer to Role Privileges for information about configuring Greenplum Database privileges.