Greenplum Database Configuration and Maintenance

You must configure Greenplum Database client host access and role privileges and attributes before using the Greenplum-Spark Connector to transfer data between your Greenplum Database and Spark clusters.

Once you start running Spark applications that use the Greenplum-Spark Connector, you may be required to perform certain Greenplum Database maintenance tasks.

These Greenplum Database configuration and maintenance tasks, described below, must be performed by a Greenplum user with administrative (SUPERUSER) privileges.

Configuring Greenplum Database

Client Host Access

You must explicitly configure Greenplum Database to permit access from all Spark nodes and stand-alone clients. Configure access for each Spark node, Greenplum database, and Greenplum Database role combination in the pg_hba.conf file on the master node.

Refer to Configuring Client Authentication in the Greenplum Database documentation for detailed information on configuring pg_hba.conf.

Role Privileges

The Greenplum-Spark Connector uses JDBC to communicate with the Greenplum Database master node. The Greenplum user/role name that you provide when you use the Greenplum-Spark Connector to transfer data between Greenplum Database and Spark must have certain privileges assigned by the administrator:

  • The user/role must have USAGE and CREATE privileges on each non-public database schema in which a table to be transferred resides:

    <db-name>=# GRANT USAGE, CREATE ON SCHEMA <schema_name> TO <user_name>;
    
  • The user/role must have the SELECT privilege on every Greenplum Database table that the user will read into Spark:

    <db-name>=# GRANT SELECT ON <schema_name>.<table_name> TO <user_name>;
    
  • The user/role must have permission to create writable external tables using the Greenplum Database gpfdist protocol:

    <db-name>=# ALTER USER <user_name> CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');
    

See the Greenplum Database Managing Roles and Privileges documention for further information on assigning privileges to Greenplum Database users.

Greenplum Database Maintenance Tasks

The Greenplum-Spark Connector uses Greenplum Database external tables to load Greenplum data into Spark. Maintenance tasks related to these external tables may include:

  • Periodically checking the status of your Greenplum Database catalogs for bloat, and VACUUM-ing the catalog as appropriate. Refer to the Greenplum Database System Catalog Maintenance and VACUUM documentation for further information.
  • Manually removing Greenplum-Spark Connector-created external tables when your Spark cluster shuts down abnormally. Refer to Cleaning Up Orphaned Greenplum External Tables for details related to this procedure.