Follow these guidelines to ensure that the Greenplum-Spark Connector works and performs optimally in your environment.
Before installing and using the Greenplum-Spark Connector, ensure that you meet the following prerequisites:
- You have administrative access to a running Greenplum Database cluster.
- You have access to a running Spark cluster.
- Network connectivity exists between the Greenplum Database master node and the Spark driver and every Spark worker node.
- Network connectivity exists between every Spark worker node and every Greenplum Database segment host.
Refer to the Hardware Provisioning Memory discussion in the Spark documentation for Spark cluster node memory configuration considerations.
The Greenplum Database master host port number (<port-num>) is configurable. The default master host port is 5432. The Greenplum-Spark Connector utilizes the Greenplum Database master port for Spark driver and worker node communication to the Greenplum Database master. Ensure that TCP port <port-num> on the Greenplum Database master host is open and accessible to the Spark driver and all Spark worker nodes.
The Greenplum-Spark Connector utilizes TCP connections to transfer data between Greenplum Database segment hosts and Spark worker nodes. By default, the Connector defers port number selection to the operating system. You can also choose to configure the port numbers that the Greenplum-Spark Connector uses for data transfer. Refer to Configuring Spark Worker Port Numbers for information about this configuration procedure.
Ensure that all ports on the Spark worker nodes in the range [1024-65535], or the ports that you configure, are accessible from every Greenplum Database segment host.