Greenplum Database Configuration and Maintenance
You must configure Greenplum Database client host access and role privileges and attributes before using the Greenplum-Spark Connector to transfer data between your Greenplum Database and Spark clusters.
Once you start running Spark applications that use the Greenplum-Spark Connector, you may be required to perform certain Greenplum Database maintenance tasks.
These Greenplum Database configuration and maintenance tasks, described below, must be performed by a Greenplum user with administrative (
SUPERUSER) privileges unless otherwise noted.
You must explicitly configure Greenplum Database to permit access from all Spark nodes and stand-alone clients. Configure access for each Spark node, Greenplum database, and Greenplum Database role combination in the
pg_hba.conf file on the master node.
Refer to Configuring Client Authentication in the Greenplum Database documentation for detailed information on configuring
The Greenplum-Spark Connector uses JDBC to communicate with the Greenplum Database master node. The Greenplum user/role name that you provide when you use the Greenplum-Spark Connector to transfer data between Greenplum Database and Spark must have certain privileges assigned by the administrator:
The user/role must have
CREATEprivileges on each non-public database schema in which tables the user will read or write reside:
<db-name>=# GRANT USAGE, CREATE ON SCHEMA <schema_name> TO <user_name>;
The user/role must have the
SELECTprivilege on every Greenplum Database table that the user will read into Spark:
<db-name>=# GRANT SELECT ON <schema_name>.<table_name> TO <user_name>;
To read a Greeplum Database table into Spark, the user/role must have permission to create writable external tables using the Greenplum Database
<db-name>=# ALTER USER <user_name> CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');
If the user/role writing to Greenplum Database is not a database or table owner, the role must have
INSERTprivileges on each existing Greenplum Database table to which the user will write Spark data:
<db-name>=# GRANT SELECT, INSERT ON <schema_name>.<table_name> TO <user_name>;
To write Spark data into a Greeplum Database table, the user/role must have permission to create readable external tables using the Greenplum Database
<db-name>=# ALTER USER <user_name> CREATEEXTTABLE(type = 'readable', protocol = 'gpfdist');
See the Greenplum Database Managing Roles and Privileges documention for further information on assigning privileges to Greenplum Database users.
The Greenplum-Spark Connector uses Greenplum Database external tables to load Greenplum data into Spark. Maintenance tasks related to these external tables may include:
- Periodically checking the status of your Greenplum Database catalogs for bloat, and
VACUUM-ing the catalog as appropriate. Refer to the Greenplum Database System Catalog Maintenance and
VACUUMdocumentation for further information.
ANALYZE-ing the Greenplum Database tables that applications using the Greenplum-Spark Connector load into Spark. Refer to the Greenplum Database Updating Statistics with ANALYZE and
ANALYZEdocumentation for further information.
- Manually removing Greenplum-Spark Connector-created external tables when your Spark cluster shuts down abnormally. Refer to Cleaning Up Orphaned Greenplum External Tables for details related to this procedure.
In addition to users with Greenplum Database
SUPERUSER privileges, database or table owners may perform the maintenance tasks identified above.
Note: This procedure is not required under normal operation. Perform this procedure only when your Spark cluster exits abnormally.
When it loads a Greenplum Database table into Spark, the Greenplum-Spark Connector creates external tables in the database schema you specify in the
dbschema option value, or in the
public schema if you do not provide a
dbschema option. If your Spark cluster unexpectedly shuts down during a Greenplum-Spark job, you may be required to manually clean up these Greenplum Database external tables.
Perform the following procedure to locate and delete orphaned Greenplum-Spark Connector external tables:
Identify the name of the Greenplum database(s) that clients were loading from when the Spark cluster shut down.
Ensure that no Spark applications or shells are actively using the Greenplum-Spark Connector.
Log in to the Greenplum Database master node as the
$ ssh gpadmin@<gpmaster> gpadmin@gpmaster$
For each Greenplum database you identified in Step 1:
psqlsubsystem, providing the database name with the
gpadmin@gpmaster$ psql -d <dbname>
Locate the Greenplum Database external tables in the database. Use the
psqlmeta-command for Greenplum Database 4.3.x, or the
\dEmeta-command for Greenplum Database 5.0. The naming format of Greenplum Database external tables created by the Greenplum-Spark Connector is
spark_<app-specific-id>_<spark-node>_<num>. For example, to list the external tables in the schema named
dbname=# \dx faa.* List of relations Schema | Name | Type | Owner | Storage --------+----------------------------------------------+-------+---------+---------- faa | spark_a820885ffab85964_80ac3fefffd80ce4_1_41 | table | bill | external
Drop the orphaned external tables. Be sure to schema-qualify the table names. For example:
dbname=# DROP EXTERNAL TABLE faa.spark_a820885ffab85964_80ac3fefffd80ce4_1_41 CASCADE;
Refer to the Greenplum Database documentation for additional information about the
DROP EXTERNAL TABLEcommand.