Overview of the Greenplum-Spark Connector
Pivotal Greenplum Database is a massively parallel processing database server specially designed to manage large scale analytic data warehouses and business intelligence workloads. Apache Spark is a fast, general-purpose computing system for distributed, large-scale data processing.
The Pivotal Greenplum-Spark Connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters to support:
- Interactive data analysis
- In-memory analytics processing
- Batch ETL
- Continuous ETL pipeline (streaming)
A Spark application consists of a driver program and executor processes running on worker nodes in your Spark cluster. When an application uses the Greenplum-Spark Connector to load a Greenplum Database table into Spark, the driver program initiates communication with the Greenplum Database master node via JDBC to request metadata information. This information helps the Connector determine where the table data is stored in Greenplum Database, and how to efficiently divide the data/work among the available Spark worker nodes.
Greenplum Database stores table data across segments. A Spark application using the Greenplum-Spark Connector to load a Greenplum Database table identifies a specific table column as a partition column. The Connector uses the data values in this column to assign specific table data rows on each Greenplum Database segment to one or more Spark partitions.
Within a Spark worker node, each application launches its own executor process. The executor of an application using the Greenplum-Spark Connector spawns a task for each Spark partition. The task communicates with the Greenplum Database master via JDBC to create and populate an external table with the data rows managed by its Spark partition. Each Greenplum Database segment then transfers this table data via HTTP directly to its Spark task. This communication occurs across all segments in parallel.
Figure: Greenplum-Spark Connector Read Architecture