LATEST VERSION: 1.4 - CHANGELOG
Pivotal Greenplum®-Spark® Connector v1.1

Greenplum Database Configuration and Maintenance

You must configure Greenplum Database client host access and role privileges and attributes before using the Greenplum-Spark Connector to transfer data between your Greenplum Database and Spark clusters.

Once you start running Spark applications that use the Greenplum-Spark Connector, you may be required to perform certain Greenplum Database maintenance tasks.

These Greenplum Database configuration and maintenance tasks, described below, must be performed by a Greenplum user with administrative (SUPERUSER) privileges.

Configuring Greenplum Database

Client Host Access

You must explicitly configure Greenplum Database to permit access from all Spark nodes and stand-alone clients. Configure access for each Spark node, Greenplum database, and Greenplum Database role combination in the pg_hba.conf file on the master node.

Refer to Configuring Client Authentication in the Greenplum Database documentation for detailed information on configuring pg_hba.conf.

Role Privileges

The Greenplum-Spark Connector uses JDBC to communicate with the Greenplum Database master node. The Greenplum user/role that you provide when you use the Greenplum-Spark Connector to transfer data between Greenplum Database and Spark must be assigned Greenplum Database SUPERUSER administrative privileges.

See the Greenplum Database Managing Roles and Privileges documention for further information on assigning privileges to Greenplum Database users.

Role Search Path

The Greenplum Database table that you specify when you use the Greenplum-Spark Connector to load a Greenplum Database into Spark must be accessible from the default schema search_path defined for the role.

Additionally, the schema named public must be the first schema named in the role’s search_path.

Refer to the Greenplum Database ALTER ROLE documentation for information on setting the default search_path for a role.

Greenplum Database Maintenance Tasks

The Greenplum-Spark Connector uses Greenplum Database external tables to load Greenplum data into Spark. Maintenance tasks related to these external tables may include:

  • Periodically checking the status of your Greenplum Database catalogs for bloat, and VACUUM-ing the catalog as appropriate. Refer to the Greenplum Database System Catalog Maintenance and VACUUM documentation for further information.
  • Manually removing Greenplum-Spark Connector-created external tables when your Spark cluster shuts down abnormally. Refer to Cleaning Up Orphaned Greenplum External Tables for details related to this procedure.