Pivotal Greenplum®-Spark® Connector v1.1


Enabling Greenplum-Spark Connector Logging

Greenplum-Spark Connector logging is governed by the logging configuration defined by the Spark application that is running with the Connector JAR file.

Spark uses log4j for logging. The default Spark log file directory is $SPARK_HOME/logs. The default Spark logging configuration file is $SPARK_HOME/conf/ This file specifies the default logging in effect for applications running on that Spark cluster node, including spark-shell. A Spark application may run with its own configuration file. Settings in this logging configuration file may identify an application-specific log file location.

To enable more verbose Greenplum-Spark Connector logging to the console, add the following setting to the file in use by the Spark application:

You can also configure a Spark application to log Greenplum-Spark Connector log messages to a separate file. For example, to configure the Greenplum-Spark Connector to log to a file named /tmp/log/greenplum-spark.log, add the following text to your Spark application’s file:, gscfile
log4j.appender.gscfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Note: When enabling Greenplum-Spark Connector logging, ensure that you create or update the the file on all Spark driver and executor nodes.

Examining Spark Log Files

Spark generates driver and executor log files. Log files associated with the executor processes of a Spark application using the Greenplum-Spark Connector will include information and errors related to data loading and RDD transformations.

Cleaning Up Orphaned Greenplum External Tables

If your Spark cluster unexpectedly shuts down during a Greenplum-Spark job, you may be required to manually clean up Greenplum Database external tables created by the Greenplum-Spark Connector.

Perform the following procedure to locate and delete orphaned Greenplum-Spark Connector external tables:

  1. Identify the name of the Greenplum database(s) that clients were loading from when the Spark cluster shut down.

  2. Ensure that no Spark applications or shells are actively using the Greenplum-Spark Connector.

  3. Log in to the Greenplum Database master node as the gpadmin administrative user:

    $ ssh gpadmin@<gpmaster>
  4. For each Greenplum database you identified in Step 1:

    1. Start the psql subsystem, providing the database name with the -d option:

      gpadmin@gpmaster$ psql -d <dbname>
    2. Locate the Greenplum Database external tables in the database. Use the \dx psql meta-command for Greenplum Database 4.3.x, or the \dE meta-command for Greenplum Database 5.0. The naming format of Greenplum Database external tables created by the Greenplum-Spark Connector is spark_<app-specific-id>_<spark-node>_<num>. For example:

      dbname=# \dx
                                 List of relations
       Schema |               Name               | Type  |  Owner  | Storage  
       public | spark_f5689bc8163c32d7_driver_76 | table | gpadmin | external
    3. Drop the orphaned external tables. Be sure to schema-qualify the table names. For example:

      dbname=# DROP EXTERNAL TABLE public.spark_f5689bc8163c32d7_driver_76 CASCADE;

      Refer to the Greenplum Database documentation for additional information about the DROP EXTERNAL TABLE command.

    4. Exit the psql session:

      dbname=# \q

Common Errors

The Greenplum-Spark Connector utilizes TCP ports for Greenplum Database to Spark cluster node communications and data transfer. Port-related errors you may encounter include:

Error Message Discussion
Address already in use
Cause: The most likely cause of this error is a random, non-Spark-related process using a port assigned to a Greenplum Database table load operation. This transient situation prevents the Spark worker from using the port to receive data from Greenplum Database.
Remedy: You can attempt to re-load the data. If the re-load fails, check to determine if a long-running service is utilizing the identified port on the Spark worker node. If so, move the offending process to a different port.
No more available ports
Cause: There are two probable causes for this rare error message:
1. The Spark worker node is running many non-Spark-related processes that are consuming ports.
2. The user’s long-lived spark-shell session or Spark application is loading a large number (hundreds to thousands) of Greenplum Database tables.
Remedy: Resolving these errors will be Spark deployment- and/or application-specific.

If your Spark application encounters Java memory errors when using the Greenplum-Spark Connector, consider increasing the partitionsPerSegment read option value. Increasing the number of Spark partitions per Greenplum Database segment decreases the memory requirements per partition.

You may choose to specify --driver-memory and --executor-memory options with your spark-shell or spark-submit command to configure specific driver and executor memory allocations. See spark-shell --help or spark-submit --help and Submitting Applications in the Spark documentation.

For additional information regarding memory considerations in your Spark application, refer to Memory Tuning in the Spark documentation and Apache Spark JIRA SPARK-6235.