Troubleshooting
Enabling Greenplum-Spark Connector Logging
Greenplum-Spark Connector logging is governed by the logging configuration defined by the Spark application that is running with the Connector JAR file.
Spark uses log4j
for logging. The default Spark log file directory is $SPARK_HOME/logs
. The default Spark logging configuration file is $SPARK_HOME/conf/log4j.properties
. This file specifies the default logging in effect for applications running on that Spark cluster node, including spark-shell
. A Spark application may run with its own log4j.properties
configuration file. Settings in this logging configuration file may identify an application-specific log file location.
To enable more verbose Greenplum-Spark Connector logging to the console, add the following setting to the log4j.properties
file in use by the Spark application:
log4j.logger.io.pivotal.greenplum.spark=DEBUG
You can also configure a Spark application to log Greenplum-Spark Connector log messages to a separate file. For example, to configure the Greenplum-Spark Connector to log to a file named /tmp/log/greenplum-spark.log
, add the following text to your Spark application’s log4j.properties
file:
log4j.logger.io.pivotal.greenplum.spark=DEBUG, gscfile
log4j.appender.gscfile=org.apache.log4j.FileAppender
log4j.appender.gscfile.file=/tmp/log/greenplum-spark.log
log4j.appender.gscfile.append=true
log4j.appender.gscfile.layout=org.apache.log4j.PatternLayout
log4j.appender.gscfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
Note: When enabling Greenplum-Spark Connector logging, ensure that you create or update the the log4j.properties
file on all Spark driver and executor nodes.
Examining Spark Log Files
Spark generates driver and executor log files. Log files associated with the executor processes of a Spark application using the Greenplum-Spark Connector will include information and errors related to data loading and RDD transformations.
Cleaning Up Orphaned Greenplum External Tables
When it loads a Greenplum Database table into Spark, the Greenplum-Spark Connector creates external tables in the database schema you specify in the dbschema
option value, or in the public
schema if you do not provide a dbschema
option. If your Spark cluster unexpectedly shuts down during a Greenplum-Spark job, you may be required to manually clean up these Greenplum Database external tables.
Perform the following procedure to locate and delete orphaned Greenplum-Spark Connector external tables:
Identify the name of the Greenplum database(s) that clients were loading from when the Spark cluster shut down.
Ensure that no Spark applications or shells are actively using the Greenplum-Spark Connector.
Log in to the Greenplum Database master node as the
gpadmin
administrative user:$ ssh gpadmin@<gpmaster> gpadmin@gpmaster$
For each Greenplum database you identified in Step 1:
Start the
psql
subsystem, providing the database name with the-d
option:gpadmin@gpmaster$ psql -d <dbname>
Locate the Greenplum Database external tables in the database. Use the
\dx
psql
meta-command for Greenplum Database 4.3.x, or the\dE
meta-command for Greenplum Database 5.0. The naming format of Greenplum Database external tables created by the Greenplum-Spark Connector isspark_<app-specific-id>_<spark-node>_<num>
. For example, to list the external tables in the schema namedfaa
:dbname=# \dx faa.* List of relations Schema | Name | Type | Owner | Storage --------+----------------------------------------------+-------+---------+---------- faa | spark_a820885ffab85964_80ac3fefffd80ce4_1_41 | table | bill | external
Drop the orphaned external tables. Be sure to schema-qualify the table names. For example:
dbname=# DROP EXTERNAL TABLE faa.spark_a820885ffab85964_80ac3fefffd80ce4_1_41 CASCADE;
Refer to the Greenplum Database documentation for additional information about the
DROP EXTERNAL TABLE
command.Exit the
psql
session:dbname=# \q
Common Errors
Port Errors
The Greenplum-Spark Connector utilizes TCP ports for Greenplum Database to Spark cluster node communications and data transfer. Port-related errors you may encounter include:
Error Message | Discussion |
---|---|
java.lang.RuntimeException: <port-number> is not a valid port number. |
Cause: The most likely cause of this error is that a port number you specified in connector.port is outside the valid range (operating system-specific, but typically [1024-65535]).Remedy: Specify connector.port port value(s) that are within the supported range. |
java.lang.RuntimeException: Unable to start GpfdistService on any of ports=<list-of-port-numbers> |
Cause: The most likely cause of this error is that the port number(s) you specified in connector.port are already in use. This situation prevents the Spark worker from using the port to receive data from Greenplum Database.Remedy: Try specifying a different set of port numbers in connector.port . |
Memory Errors
If your Spark application encounters Java memory errors when using the Greenplum-Spark Connector, consider increasing the partitionsPerSegment
read option value. Increasing the number of Spark partitions per Greenplum Database segment decreases the memory requirements per partition.
You may choose to specify --driver-memory
and --executor-memory
options with your spark-shell
or spark-submit
command to configure specific driver and executor memory allocations. See spark-shell --help
or spark-submit --help
and Submitting Applications in the Spark documentation.
For additional information regarding memory considerations in your Spark application, refer to Memory Tuning in the Spark documentation and Apache Spark JIRA SPARK-6235.
Greenplum Database Connection Errors
A Spark application may encounter “connection limit exceeded” errors when the number of open connections to the Greenplum Database server approaches its configured maximum limit (max_connections
).
The Greenplum Database pg_stat_activity
view provides information about current database activity. To help troubleshoot connection-related errors, run the following Greenplum Database commands and queries to determine the number and source of open connections to Greenplum Database.
Display the
max_connections
setting for the Greenplum Database server:postgres=# show max_connections; max_connections ----------------- 250 (1 row)
Display the number of open connections to the Greenplum Database server:
postgres=# SELECT count(*) FROM pg_stat_activity;
View the number of connections to a specific database or from a specific user:
postgres=# SELECT count(*) FROM pg_stat_activity WHERE datname='tutorial'; postgres=# SELECT count(*) FROM pg_stat_activity WHERE usename='user1';
Display idle and active query counts in the Greenplum Database cluster:
postgres=# SELECT count(*) FROM pg_stat_activity WHERE current_query='<IDLE>'; postgres=# SELECT count(*) FROM pg_stat_activity WHERE current_query!='<IDLE>';
View the database name, user name, client address, client port, and current query for each open connection to the Greenplum Database server:
postgres=# SELECT datname, usename, client_addr, client_port, current_query FROM pg_stat_activity;
If you identify a Spark application using the Greenplum-Spark Connector as the source of too many open connections, adjust the connection pool configuration options appropriately. Refer to JDBC Connection Pooling for additional information about connection pool configuration options.